1 INTRODUCTION: FRAMINGHAM DATA SUBSET

The Framingham Heart Study, initiated in 1948, was designed to identify the factors and characteristics that contribute to the development of cardiovascular disease. Based in Massachusetts, the study has followed multiple generations of participants using a longitudinal design that includes regular examinations and health assessments. Participants undergo comprehensive medical evaluations, laboratory testing, and lifestyle assessments. The study captures a wide range of risk factors, including both behavioral and biological components.

This analysis focuses on a subset of data from the Framingham Heart Study. Specifically, it investigates how selected biological and behavioral risk factors predict the likelihood of developing coronary heart disease (CHD) within a ten-year period.

1.1 Description of Data

The original Framingham Heart Study includes data from approximately 15,000 participants and over 1,000 variables. For this analysis, a subset of the data is used, containing 4,240 observations and 16 variables.

This analysis focuses specifically on biological features that directly contribute to health outcomes. Prior analysis (https://jmusa3.github.io/jmusa/project/Project_2.html) indicated that education was not a significant predictor of either Ten-Year Coronary Heart Disease (CHD) risk or total cholesterol levels, supporting its removal from the feature set. As such, the variable “Education” has been excluded from the dataset.

The original feature “male” is changed to “sex” for ease of comprehension. After excluding “Education,” the dataset for this analysis consists of 15 predictor variables:

•   7 factor variables (categorical)
•   8 numeric variables (continuous)

The names and definitions of these variables are provided below.

  1. sex (factor): the sex of the observations: female (0), male (1).
  2. age (numerical): Age at the time of medical examination in years.
  3. currentSmoker (factor): Current cigarette smoking at the time of examinations
  4. cigsPerDay (numerical): Number of cigarettes smoked each day
  5. BPmeds (factor): Use of Anti-hypertensive medication at exam
  6. prevalentStroke (factor): Prevalent Stroke (0 = free of disease)
  7. prevalentHyp (factor): Prevalent Hypertensive. Subject was defined as hypertensive if treated
  8. diabetes (factor): Diabetic according to criteria of first exam treated
  9. totChol (numerical): Total cholesterol (mg/dL)
  10. sysBP (numerical): Systolic Blood Pressure (mmHg)
  11. diaBP (numerical): Diastolic blood pressure (mmHg)
  12. BMI (numerical): Body Mass Index, weight (kg)/height (m)^2
  13. heartRate (numerical): Heart rate (beats/minute)
  14. glucose (numerical): Blood glucose level (mg/dL)
  15. TenYearCHD (factor, response variable): The 10 year risk of coronary heart disease(CHD).

The data set has missing values for six variables: cigPerDay, BPMeds, totChol, BMI, heartRate, and glucose.

fhs <- read.csv("https://raw.githubusercontent.com/jmusa3/jmusa/refs/heads/main/FraminghamHeartStudy.csv")

fhs <- fhs %>% dplyr::select(-education)

# change 'male' to 'sex' where 1 = male and 0 = female
fhs <- fhs %>%
  rename(sex = male)
fhs$sex <- factor(fhs$sex)
fhs$BPMeds <- factor(fhs$BPMeds)
fhs$currentSmoker <- factor(fhs$currentSmoker)
fhs$prevalentStroke <- factor(fhs$prevalentStroke)
fhs$prevalentHyp <- factor(fhs$prevalentHyp)
fhs$diabetes <- factor(fhs$diabetes)

summary(fhs)
 sex           age        currentSmoker   cigsPerDay      BPMeds    
 0:2420   Min.   :32.00   0:2145        Min.   : 0.000   0   :4063  
 1:1820   1st Qu.:42.00   1:2095        1st Qu.: 0.000   1   : 124  
          Median :49.00                 Median : 0.000   NA's:  53  
          Mean   :49.58                 Mean   : 9.006              
          3rd Qu.:56.00                 3rd Qu.:20.000              
          Max.   :70.00                 Max.   :70.000              
                                        NA's   :29                  
 prevalentStroke prevalentHyp diabetes    totChol          sysBP      
 0:4215          0:2923       0:4131   Min.   :107.0   Min.   : 83.5  
 1:  25          1:1317       1: 109   1st Qu.:206.0   1st Qu.:117.0  
                                       Median :234.0   Median :128.0  
                                       Mean   :236.7   Mean   :132.4  
                                       3rd Qu.:263.0   3rd Qu.:144.0  
                                       Max.   :696.0   Max.   :295.0  
                                       NA's   :50                     
     diaBP            BMI          heartRate         glucose      
 Min.   : 48.0   Min.   :15.54   Min.   : 44.00   Min.   : 40.00  
 1st Qu.: 75.0   1st Qu.:23.07   1st Qu.: 68.00   1st Qu.: 71.00  
 Median : 82.0   Median :25.40   Median : 75.00   Median : 78.00  
 Mean   : 82.9   Mean   :25.80   Mean   : 75.88   Mean   : 81.96  
 3rd Qu.: 90.0   3rd Qu.:28.04   3rd Qu.: 83.00   3rd Qu.: 87.00  
 Max.   :142.5   Max.   :56.80   Max.   :143.00   Max.   :394.00  
                 NA's   :19      NA's   :1        NA's   :388     
   TenYearCHD    
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.1519  
 3rd Qu.:0.0000  
 Max.   :1.0000  
                 
#head(fhs)

Table 1. Summary statistics for the 15 features in the Framingham Heart Study subset. Six of the variables have missing values.

2 EDA

2.1 Single Features

All of the features in this data set are typed as numerical. In turn, indicator variables are changed to factor. For visualization, “male” is changed to “sex”. The indicator values for the categorical variables were changed to descriptive words or phrases to enhance readability of the graphs.

# Set up plotting area
par(mfrow = c(2, 2))

# Histogram for age
hist(fhs$age,
     main = "Age",
     xlab = "Age",
     col = "steelblue",
     border = "white")

# Histogram for cigsPerDay
hist(fhs$cigsPerDay,
     main = "Cigarettes Per Day",
     xlab = "Cigarettes Per Day",
     col = "steelblue",
     border = "white")

# Histogram for totChol
hist(fhs$totChol,
     main = "Total Cholesterol",
     xlab = "Total Cholesterol",
     col = "steelblue",
     border = "white")

# Histogram for sysBP
hist(fhs$sysBP,
     main = "Systolic Blood Pressure",
     xlab = "Systolic BP",
     col = "steelblue",
     border = "white")

# Reset plotting layout to default
par(mfrow = c(1, 1))

Figure A. Histograms of age, CigsPerDay, totChol (response), and sysBP.

# Set up 2x2 plotting area
par(mfrow = c(2, 2))

# Histogram for diastolic blood pressure
hist(fhs$diaBP,
     main = "Diastolic Blood Pressure",
     xlab = "Diastolic BP",
     col = "steelblue",
     border = "white")

# Histogram for BMI
hist(fhs$BMI,
     main = "Body Mass Index (BMI)",
     xlab = "BMI",
     col = "steelblue",
     border = "white")

# Histogram for heart rate
hist(fhs$heartRate,
     main = "Heart Rate",
     xlab = "Beats Per Minute",
     col = "steelblue",
     border = "white")

# Histogram for glucose
hist(fhs$glucose,
     main = "Glucose Level",
     xlab = "Glucose",
     col = "steelblue",
     border = "white")

# Reset to default plotting layout
par(mfrow = c(1, 1))

Figure B. Histograms for diaBP, BMI, heartRate, and glucose.

# Set up 2x2 plotting layout
par(mfrow = c(2, 2))


# Bar plot for sex
barplot(table(fhs$sex),
        main = "Sex Distribution",
        col = "steelblue",
        names.arg = c("Female", "Male"),
        ylab = "Count")



# Bar plot for current smoker
barplot(table(fhs$currentSmoker),
        main = "Current Smoker Status",
        col = "steelblue",
        names.arg = c("No", "Yes"),
        ylab = "Count")

# Bar plot for BP meds
barplot(table(fhs$BPMeds),
        main = "Blood Pressure Medication",
        col = "steelblue",
        names.arg = c("No", "Yes"),
        ylab = "Count")

Figure C. Bar charts for sex, currentSmoker and BPMeds.

# Set 2x2 layout for bar charts
par(mfrow = c(2, 2))

# Bar chart for prevalentStroke
barplot(table(fhs$prevalentStroke),
        main = "Prevalent Stroke",
        xlab = "0 = No, 1 = Yes",
        col = "steelblue")

# Bar chart for prevalentHyp
barplot(table(fhs$prevalentHyp),
        main = "Prevalent Hypertension",
        xlab = "0 = No, 1 = Yes",
        col = "steelblue")

# Bar chart for diabetes
barplot(table(fhs$diabetes),
        main = "Diabetes",
        xlab = "0 = No, 1 = Yes",
        col = "steelblue")

# Bar chart for TenYearCHD
barplot(table(fhs$TenYearCHD),
        main = "Ten-Year CHD",
        xlab = "0 = No, 1 = Yes",
        col = "steelblue")

Figure D. Bar charts for prevalentStroke, prevalentHyp, diabetes, and TenYearCHD (response).

3 MISSING VALUES

Imputation is a critical step in data preprocessing, especially when working with datasets containing missing values. It involves estimating and replacing missing data with plausible values based on the available information.

For the missing cigsPerDay values, any instances that are not current smokers (currentSmoker = 0) and have missing cigsPerDay will be imputed with zero. All other instances with missing cigsPerDay will be imputed with the mean.

# Step 1: Save the original cigsPerDay before imputation
cigs_before <- fhs$cigsPerDay

# Step 2: Impute using currentSmoker
mean_cigs_smokers <- mean(fhs$cigsPerDay[fhs$currentSmoker == 1 & !is.na(fhs$cigsPerDay)], na.rm = TRUE)
fhs$cigsPerDay[is.na(fhs$cigsPerDay) & fhs$currentSmoker == 0] <- 0
fhs$cigsPerDay[is.na(fhs$cigsPerDay) & fhs$currentSmoker == 1] <- mean_cigs_smokers

# Step 3: Create a combined data frame for plotting
cigs_df <- rbind(
  data.frame(value = cigs_before, group = "Before"),
  data.frame(value = fhs$cigsPerDay, group = "After")
)

# Step 4: Plot the overlay density curve
ggplot(cigs_df, aes(x = value, fill = group, color = group)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of cigsPerDay Before and After Imputation",
       x = "Cigarettes Per Day", y = "Density") +
  scale_fill_manual(values = c("Before" = "lightblue", "After" = "salmon")) +
  scale_color_manual(values = c("Before" = "lightblue", "After" = "salmon")) +
  theme_minimal()

Figure E. Post imputation, cigsPerDay has zero missing values. The before and after imputation distributions are very similar.

The distributions of CigsPerDay before and after imputation are very similar, indicating that the imputation process effectively preserved the original data characteristics.

Multiple imputation by chained equations (MICE) is used to impute missing values for cigPerDay, BPMeds, totChol, BMI, heartRate, and glucose. MICE is an iterative method that creates multiple imputed datasets, improving the reliability of the imputation process by incorporating uncertainty in the missing values. The imputed values are expected to provide a more complete dataset for subsequent analysis, such as modeling and prediction.

# Create a data frame for the features before imputation (original data)
fhs_missing_before <- fhs[, c("cigsPerDay", "totChol", "BMI", "heartRate", "glucose")]

# Perform MICE imputation while fully suppressing console output
invisible(capture.output({
  mice_imputation <- mice(fhs_missing_before, method = 'pmm', m = 5, maxit = 50, seed = 500)
}))

# Complete the imputed data
fhs_imputed <- complete(mice_imputation, action = 1)

# Prepare data for plotting (before and after)
fhs_before <- data.frame(value = as.vector(as.matrix(fhs_missing_before)),
                         variable = rep(colnames(fhs_missing_before), each = nrow(fhs_missing_before)),
                         dataset = rep("Before", nrow(fhs_missing_before) * ncol(fhs_missing_before)))

fhs_after <- data.frame(value = as.vector(as.matrix(fhs_imputed)),
                        variable = rep(colnames(fhs_imputed), each = nrow(fhs_imputed)),
                        dataset = rep("After", nrow(fhs_imputed) * ncol(fhs_imputed)))

# Combine the data for before and after into one data frame
fhs_combined <- rbind(fhs_before, fhs_after)

# Create the density plot with ggplot
ggplot(fhs_combined, aes(x = value, fill = dataset, color = dataset)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Density Plots: Before and After Imputation", 
       x = "Value", y = "Density") +
  theme_minimal() +
  scale_fill_manual(values = c("Before" = "lightblue", "After" = "pink")) +
  scale_color_manual(values = c("Before" = "lightblue", "After" = "pink"))

Figure F. Before and after imputation density curves.

The distributions of the variables before and after imputation are very similar, indicating that the imputation process effectively preserved the original data characteristics. The use of multiple imputation by chained equations (MICE) has ensured that the imputed values closely follow the patterns and spread of the observed data, maintaining the integrity of the variables for further analysis. In turn, the imputed values is added to the final working dataset for analysis.

# Overwrite original columns in fhs with imputed values
fhs$cigsPerDay <- fhs_imputed$cigsPerDay
fhs$totChol <- fhs_imputed$totChol
fhs$BMI <- fhs_imputed$BMI
fhs$heartRate <- fhs_imputed$heartRate
fhs$glucose <- fhs_imputed$glucose

Prior to modeling, the variable BPMeds (whether the participant was on blood pressure medication) contains missing values. To address this, a logistic regression model is fitted using relevant predictors (age, sysBP, diaBP, cigsPerDay, and prevalentHyp) based on cases with observed BPMeds values. Predicted probabilities from this model were then used to impute missing values, with a threshold of 0.5 to classify individuals as either taking or not taking blood pressure medication.

# Save original BPMeds with NAs before imputation
fhs$BPMeds_original <- fhs$BPMeds

# Keep rows where BPMeds is not missing
bpm_complete <- fhs[!is.na(fhs$BPMeds_original), ]

# Fit logistic regression model
bpm_model <- glm(BPMeds_original ~ age + sysBP + diaBP + cigsPerDay + prevalentHyp,
                 data = bpm_complete, family = binomial)

# Identify rows with missing BPMeds
bpm_missing <- fhs[is.na(fhs$BPMeds_original), ]

# Identify rows among bpm_missing that have complete predictors
predictable_rows <- complete.cases(bpm_missing[, c("age", "sysBP", "diaBP", "cigsPerDay", "prevalentHyp")])

# Predict only on rows we can predict
if (any(predictable_rows)) {
  predicted_probs <- predict(bpm_model, newdata = bpm_missing[predictable_rows, ], type = "response")
  
  # Get the original row numbers for the missing data
  missing_row_indices <- which(is.na(fhs$BPMeds_original))
  
  # Find indices that are predictable
  predictable_indices <- missing_row_indices[predictable_rows]
  
  # Impute predicted values into the correct rows
  fhs$BPMeds[predictable_indices] <- ifelse(predicted_probs >= 0.5, 1, 0)
}


# Make sure fhs$BPMeds and fhs$BPMeds_original are there

# Create a long-format data frame
bpm_plot_data <- data.frame(
  value = c(fhs$BPMeds_original, fhs$BPMeds),
  status = rep(c("Before Imputation", "After Imputation"), each = nrow(fhs))
)

# Clean bar plot
ggplot(bpm_plot_data, aes(x = factor(value), fill = status)) +
  geom_bar(position = "dodge") +
  labs(title = "BPMeds Before and After Imputation",
       x = "BPMeds (0 = No, 1 = Yes)",
       y = "Count",
       fill = "Status") +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_minimal(base_size = 14)

Figure G. Side by side bar chart of before and after logistic imputation.

The distribution of BPMeds before and after logistic imputation is very similar, indicating that the imputation process preserved the underlying structure of the data without introducing substantial bias.

# Remove BPMeds_original column after imputation and plotting
fhs$BPMeds_original <- NULL

4 FEATURE SELECTION

4.1 Feature Selection

To reduce the number of variables included in pairwise relationship plots, feature selection was performed using a Random Forest regression model. The goal is to identify the most important numerical predictors of TenYearCHD, so that only the most relevant variables are examined in scatterplots and other visualizations.

# Subset numeric predictors
numeric_vars <- sapply(fhs, is.numeric)
fhs_numeric <- fhs[, numeric_vars]

# Make sure TenYearCHD is included
fhs_numeric$TenYearCHD <- fhs$TenYearCHD  

# Random Forest for numeric features
library(randomForest)
rf_numeric <- randomForest(TenYearCHD ~ ., data = fhs_numeric, importance = TRUE)

# Extract importance
importance_numeric <- importance(rf_numeric)

# Create importance dataframe based on IncNodePurity
importance_numeric_df <- data.frame(
  Variable = rownames(importance_numeric),
  Importance = importance_numeric[, "IncNodePurity"]
)

# Sort by importance
importance_numeric_df <- importance_numeric_df[order(-importance_numeric_df$Importance), ]

# View results
print(importance_numeric_df)
             Variable Importance
sysBP           sysBP   71.20197
BMI               BMI   68.36096
totChol       totChol   64.73109
glucose       glucose   62.65319
age               age   61.73497
diaBP           diaBP   60.80589
heartRate   heartRate   49.53097
cigsPerDay cigsPerDay   28.65519

Table 2. Most important numeric features.

Using a Random Forest regression model, feature importance was evaluated for the numerical predictors of TenYearCHD. The results indicate that systolic blood pressure (sysBP), body mass index (BMI), and total cholesterol (totChol) are the most influential variables in predicting ten-year coronary heart disease risk. Other notable predictors include glucose, age, diastolic blood pressure (diaBP), and heart rate, while cigarettes per day (cigsPerDay) had the lowest relative importance among the numerical features considered.

To identify the most important categorical predictors of TenYearCHD, a Random Forest classification model was fit using only the categorical variables in the dataset. This feature selection step helps focus pairwise analyses on the most relevant categorical factors, improving the clarity and interpretability of the visualizations.

# Subset only categorical variables
factor_vars <- sapply(fhs, is.factor)
fhs_factor <- fhs[, factor_vars]


# Check for missing values in the fhs_factor dataset
#colSums(is.na(fhs_factor))

# Remove BPMeds_original if it exists
fhs_factor$BPMeds_original <- NULL

# Ensure the target variable TenYearCHD is a factor in the original data
fhs$TenYearCHD <- as.factor(fhs$TenYearCHD)

# Add TenYearCHD back to the factor dataset if it was dropped
fhs_factor$TenYearCHD <- fhs$TenYearCHD

# Train Random Forest classifier on categorical features
rf_factor <- randomForest(TenYearCHD ~ ., data = fhs_factor, importance = TRUE)

# Get importance values
importance_factor <- importance(rf_factor)

# Convert to a clean data frame
importance_factor_df <- data.frame(
  Variable = rownames(importance_factor),
  Importance = importance_factor[, "MeanDecreaseGini"]
)

# Sort by importance
importance_factor_df <- importance_factor_df[order(-importance_factor_df$Importance), ]

# Print
print(importance_factor_df)
                       Variable Importance
prevalentHyp       prevalentHyp  27.738833
diabetes               diabetes   8.203009
sex                         sex   7.644328
BPMeds                   BPMeds   5.466221
prevalentStroke prevalentStroke   4.588553
currentSmoker     currentSmoker   2.841952

Table 3. Most important categorical features.

Using a Random Forest classification model, feature importance was evaluated for the categorical predictors of TenYearCHD. The results show that prevalent hypertension (prevalentHyp) is by far the most influential categorical variable, followed by diabetes and sex. Other categorical features such as prevalent stroke, use of blood pressure medication (BPMeds), and current smoking status had lower relative importance in predicting ten-year coronary heart disease risk.

5 PAIRWISE RELATIONSHIPS

To explore the relationships between the most important numerical predictors of TenYearCHD, pairwise scatterplots are generated for the top five variables identified through Random Forest feature selection. These include systolic blood pressure (sysBP), body mass index (BMI), total cholesterol (totChol), glucose, and age. Analyzing these pairwise relationships helps to visually understand how these features correlate with each other and with the target variable, TenYearCHD.

# Select the top 5 most important numerical features
top_features <- fhs[, c("sysBP", "BMI", "totChol", "glucose", "age")]

# Compute the correlation matrix
cor_matrix <- cor(top_features)

# Print the correlation matrix with numbers
#print(cor_matrix)

# Create a clean correlation plot with numbers displayed inside the cells
corrplot(cor_matrix, 
         method = "color",         # Color-based visualization
         type = "full",            # Full matrix display
         col = colorRampPalette(c("red", "white", "blue"))(200), # Color gradient
         tl.cex = 0.8,             # Adjust label size
         number.cex = 0.8,         # Adjust correlation number size
         addCoef.col = "black",    # Add correlation coefficients in black
         title = "Correlation Matrix for Top 5 Numerical Features", 
         mar = c(0,0,1,0))         # Margins around the plot

Figure H. Correlation matrix for continuous features.

The scatterplot matrix below displays pairwise relationships between the five most important numerical features identified for predicting coronary heart disease. These features include systolic blood pressure (sysBP), body mass index (BMI), total cholesterol (totChol), glucose, and age. The scatterplots provide a visual representation of how these variables relate to one another, with each plot showcasing the distribution of data points between two features. The inclusion of bootstrapping reduces clutter by sampling 30% of the data, allowing for clearer insights into potential patterns and associations.

# Load necessary libraries
library(ggplot2)
library(gridExtra)

# Select the top 5 most important numerical features
top_features <- fhs[, c("sysBP", "BMI", "totChol", "glucose", "age")]

# Bootstrap the data (reduce size to 30% to declutter scatterplots)
set.seed(123)  # For reproducibility
top_features_sample <- top_features[sample(1:nrow(top_features), 0.3 * nrow(top_features)), ]

# Create the scatterplot matrix manually
pairs_list <- list()

# Loop over all pairs of features to create individual scatterplots
for(i in 1:(ncol(top_features_sample)-1)) {
  for(j in (i+1):ncol(top_features_sample)) {
    pairs_list[[paste(names(top_features_sample)[i], names(top_features_sample)[j], sep = "_")]] <- 
      ggplot(top_features_sample, aes_string(x = names(top_features_sample)[i], y = names(top_features_sample)[j])) +
      geom_point(alpha = 0.3, size = 0.5) + 
      labs(title = paste(names(top_features_sample)[i], "vs", names(top_features_sample)[j])) +
      theme_minimal() +
      theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5, vjust = 0),
        axis.title = element_text(size = 8),
        axis.text = element_text(size = 7)
      )
  }
}

# Arrange all scatterplots in a grid layout
grid.arrange(grobs = pairs_list, ncol = 5)  # Arrange in 5 columns, adjust ncol if needed

Figure I. Scatterplots of the five most important numeric features.

The pairwise scatterplots reveal several moderate linear relationships among the top features. Specifically, systolic blood pressure (sysBP) and body mass index (BMI) exhibit a moderate linear relationship, as do sysBP and total cholesterol (totChol), BMI and totChol, sysBP and age, and totChol and age. These relationships suggest that as one variable increases, there tends to be a corresponding increase in the other, indicating a moderate association between these key health factors.

# Set up the plotting area to have 2 rows and 3 columns
par(mfrow = c(2, 3))

# List of categorical variables to compare with TenYearCHD
categorical_vars <- c("sex", "currentSmoker", "BPMeds", "prevalentStroke", "prevalentHyp", "diabetes")

# Loop through each categorical variable
for (var in categorical_vars) {
  # Create mosaic plot for each variable compared with TenYearCHD
  mosaicplot(table(fhs[[var]], fhs$TenYearCHD), main = paste("TenYearCHD vs", var), 
             color = c("lightblue", "pink"), 
             xlab = paste(var), ylab = "TenYearCHD",
             las = 1)
  # Add labels for 1 = Yes, 0 = No
  text(x = 1, y = 1, labels = "1 = Yes", col = "black", pos = 4)
  text(x = 1, y = 2, labels = "0 = No", col = "black", pos = 4)
}

Figure J. Mosaic plots of relationships of categorical features.

The mosaic plots reveal that there is a significant association between TenYearCHD and all the categorical variables, except for “currentSmoker.” Specifically, variables like “sex,” “BPMeds,” “prevalentStroke,” “prevalentHyp,” and “diabetes” show clear patterns where the distribution of TenYearCHD differs across their levels. However, “currentSmoker” does not appear to exhibit a strong association with the outcome, suggesting that smoking status may not be a strong predictor of TenYearCHD in this dataset.

6 FEATURE ENGINEER

To enhance model performance and uncover meaningful patterns, feature engineering was applied to the Framingham Heart Study dataset. This process included creating new variables, such as clusters, interaction terms, and clinically relevant risk indicators, to better capture complex relationships with ten-year cardiovascular disease risk.

# Remove non-numeric columns (like factors and response variable)
fhs_numeric <- fhs[sapply(fhs, is.numeric)]
fhs_scaled <- scale(fhs_numeric)  # Standardize the features

set.seed(123)

# Standardize numeric variables
fhs_scaled <- scale(fhs[, sapply(fhs, is.numeric)])

# Compute WSS for k = 1 to 10
wss <- sapply(1:10, function(k){
  kmeans(fhs_scaled, centers = k, nstart = 10)$tot.withinss
})

# Plot Elbow Curve
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of clusters K",
     ylab = "Total within-clusters sum of squares",
     main = "Elbow Method for Determining Optimal K")

# Run k-means with chosen number of clusters (e.g., 3)
kmeans_result <- kmeans(fhs_scaled, centers = 3, nstart = 25)

# Add cluster to the dataset
fhs$Cluster <- as.factor(kmeans_result$cluster)

# Run PCA just for plotting (2D representation)
pca_result <- prcomp(fhs_scaled)

# Create data frame for plotting
plot_data <- data.frame(PC1 = pca_result$x[,1],
                        PC2 = pca_result$x[,2],
                        Cluster = fhs$Cluster)

# Plot the clusters
library(ggplot2)
ggplot(plot_data, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(title = "K-Means Clustering (Visualized via PCA)",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

Figure K. Visualization of clusters.

The cluster analysis revealed three well-separated and distinct groups within the Framingham Heart dataset, suggesting meaningful underlying patterns in the data. These clusters may represent different cardiovascular risk profiles, which could help inform targeted prevention or intervention strategies.

7 METHODOLOGY: Predictive Modeling Approaches

Previous methods, such as regression models (including regularized versions like Lasso and Ridge) and SVMs, were used to address the formulated questions, excelling in linear relationships and smaller datasets. However, these models can struggle with non-linear data or high-dimensional feature spaces.

CART, Bagging, and Random Forests offer advantages by capturing complex, non-linear relationships without the need for linear assumptions. Random Forests, in particular, improve upon Bagging by constructing multiple decision trees and averaging their outputs, which helps to further reduce variance and avoid overfitting. While CART is interpretable, it may overfit without proper pruning. Bagging helps reduce variance by combining multiple models, and Random Forests further improve performance with ensemble learning. Each method has its strengths: regression models and SVMs excel at handling smaller, simpler datasets with linear or regularized relationships, while CART, Bagging, and Random Forests shine with larger datasets and more complex interactions. The trade-off between interpretability and performance is a key consideration when selecting the appropriate model for a given task.

8 CART REGRESSION

CART (Classification and Regression Trees) regression is a decision tree algorithm used to predict continuous outcomes, such as total cholesterol (totChol) in the Framingham Heart Study dataset. The model development follows these key components:

1.  Splitting Criterion: The tree recursively partitions the data by choosing splits that minimize the sum of squared errors (SSE) within the resulting nodes, aiming to create homogenous groups based on predictors like age, BMI, and blood pressure.
2.  Tree Structure and Stopping Rules: To prevent the tree from growing too complex, stopping rules such as minimum number of observations required to split a node (minsplit), minimum size of terminal nodes (minbucket), and maximum tree depth (maxdepth) are applied.
3.  Pruning: After the full tree is grown, cost-complexity pruning is used to avoid overfitting. This involves evaluating subtrees using cross-validation and selecting the optimal complexity parameter (cp) that balances model accuracy with simplicity.
4.  Interpretability: The final tree structure provides a transparent model showing how different combinations of predictors lead to different predicted cholesterol levels.

8.1 Initial Regression Tree

An initial regression tree was built to explore how various health and lifestyle factors predict total cholesterol levels in the Framingham dataset. By setting relatively flexible tree parameters, the model was able to identify key variables and potential nonlinear relationships through recursive partitioning.

# Set seed for reproducibility
set.seed(123)


# Split the data into training (70%) and testing (30%) sets
trainIndex <- createDataPartition(fhs$TenYearCHD, p = 0.7, list = FALSE)
train_data <- fhs[trainIndex, ]
test.data <- fhs[-trainIndex, ]


# Build the initial CART regression tree for predicting total cholesterol
tree_model <- rpart(
  totChol ~ ., 
  data = train_data,
  method = "anova",  # Regression tree
  control = rpart.control(
    minsplit = 10,         # Minimum number of observations required to attempt a split
    minbucket = 5,         # Minimum number of observations in terminal nodes
    cp = 0.002,             # Complexity parameter 
    maxdepth = 5           # Maximum depth of the tree
  )
)

# Visualize the unpruned regression tree
rpart.plot(tree_model, main = "Initial CART Regression Tree for Total Cholesterol")

Figure L. Initial regression tree before pruning.

9 Hyperparameter Tuning

To improve model performance and prevent overfitting, we performed hyperparameter tuning on the CART regression model using 10-fold cross-validation. Specifically, we tuned the complexity parameter (cp), which controls the trade-off between tree size and predictive accuracy.

# Set seed for reproducibility
set.seed(123)

# Build the initial regression tree with cross-validation
tree_model <- rpart(
  totChol ~ ., 
  data = train_data,
  method = "anova",  # Regression tree
  control = rpart.control(
    minsplit = 10,
    minbucket = 5,
    cp = 0.01,
    maxdepth = 5,
    xval = 10
  )
)

# View the complexity parameter table
pander(tree_model$cptable)
CP nsplit rel error xerror xstd
0.1619 0 1 1.001 0.04969
0.01281 1 0.8381 0.8389 0.04484
0.01185 2 0.8253 0.841 0.04547
0.01034 3 0.8134 0.8255 0.04547
0.01 4 0.8031 0.819 0.0453
# Plot the complexity parameter vs. error (xerror)
plotcp(tree_model)

# Choose the optimal cp based on xerror
min_xerror <- min(tree_model$cptable[, "xerror"])
best_cp_row <- which.min(tree_model$cptable[, "xerror"])
best_cp <- tree_model$cptable[best_cp_row, "CP"]

# Print the best cp value
cat("Best CP: ", round(best_cp, 4), "\n")
Best CP:  0.01 

Table 4. Cross-validation results for different complexity parameters.

Figure M. Plot illustrates the error curve.

Although the complexity plot shows an elbow at cp = 0.051, indicating a point where further splits begin to yield diminishing returns, cross-validation identified cp = 0.01 as the optimal value for minimizing prediction error while maintaining a reasonable model complexity. More specifically, the plot indicates that cp = 0.12 is best when using the 1-standard error (1-SE) rule.

9.1 Pruned Regression Tree: Best CP (1-SE)

This code selects complexity parameters (cp) for pruning the regression tree based on both the minimum cross-validation error and the 1-SE rule. It then prunes the tree and makes predictions on the test set, saving the R-squared and RMSE values for later comparison across regression models.

# Step 1: Identify the minimum xerror and its corresponding cp (for pruning with minimum error)
cp.table <- tree_model$cptable
min.xerror <- min(cp.table[, "xerror"])
min.cp.row <- which.min(cp.table[, "xerror"])
min.cp <- cp.table[min.cp.row, "CP"]

# Step 2: Compute the standard error of the minimum xerror
xerror.std <- cp.table[min.cp.row, "xstd"]
threshold <- min.xerror + xerror.std

# Step 3: Override best.cp manually to get a more complex tree
best.cp <- 0.012  # based on 1-SE rule, max CP for xerror < 1

# Step 4: Prune the tree using both min.cp and manually specified best.cp
pruned.tree.best.cp <- prune(tree_model, cp = best.cp)
pruned.tree.min.cp <- prune(tree_model, cp = min.cp)

# Step 5: Visualize the manually pruned tree
rpart.plot(pruned.tree.best.cp, 
           main = paste("Pruned Tree Best CP (1-SE): cp =", round(best.cp, 5)))

# Step 6: Make predictions on test data
pred.best.cp <- predict(pruned.tree.best.cp, newdata = test.data)
pred.min.cp  <- predict(pruned.tree.min.cp, newdata = test.data)

# Step 7: Save R-squared values
rsquared.bestcp <- cor(test.data$totChol, pred.best.cp)^2

# Save MSE for best.cp (1-SE rule)
rmse.bestcp <- sqrt(mean((pred.best.cp - test.data$totChol)^2))

Figure N. Pruned Tree Best CP using 1-SE Rule.

9.2 Pruned Regression Tree: Minimum CP

The following tree represents the pruned regression model using the complexity parameter (CP) that minimizes cross-validation error. This version of the tree (Minimum CP) aims to balance model complexity and predictive accuracy by fitting as closely as possible to the training data without overfitting.

# Visualize the pruned tree using the minimum CP value
rpart.plot(pruned.tree.min.cp,
           main = paste("Pruned Regression Tree (Min CP): cp =", round(min.cp, 4)),
           type = 2,                # Label all nodes with predicted values
           extra = 101,             # Show fitted values and % of observations at each node
           under = TRUE,            # Display the split condition underneath the node
           faclen = 0,              # Avoid abbreviating factor levels
           cex = 0.7,               # Shrink text for better fit
           fallen.leaves = TRUE,    # Better layout for the leaves at the bottom
           box.palette = "GnBu",    # Green-blue color theme for boxes
           shadow.col = "gray")     # Add shadows to boxes for depth

#r^2 for min.cp
rsquared.mincp  <- cor(test.data$totChol, pred.min.cp)^2
#Save MSE for min.cp (minimum xerror)
rmse.mincp <- sqrt(mean((pred.min.cp - test.data$totChol)^2))

Figure O. Pruned Regression Tree, Minimum CP.

The final CART regression model predicting total cholesterol identified sex and cigarettes per day as the most important predictors. This suggests that cholesterol levels in the Framingham dataset are most effectively segmented by gender and smoking behavior.

9.3 BAGGING CART Regression

To further improve predictive performance, we next apply Bagging (Bootstrap Aggregating) to CART regression. Bagging builds multiple trees on different bootstrap samples of the data and averages their predictions, helping to reduce variance and improve model stability compared to a single decision tree.

To optimize the Bagging CART regression model, we perform a systematic hyperparameter tuning process. We split the data into training and testing sets, set up 5-fold cross-validation, and explore different combinations of the number of bagged trees, complexity parameters (cp), and maximum tree depths to identify the settings that minimize out-of-bag (OOB) error.

# Split the data into training and testing sets
set.seed(123)
train.index <- createDataPartition(fhs$totChol, p = 0.8, list = FALSE)
train.data <- fhs[train.index, ]
test.data <- fhs[-train.index, ]

# Set up train control for cross-validation
ctrl <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = TRUE
)

# Define parameter combinations to test
nbagg.values <- c(10, 25, 50)     # number of bagged trees
cp.values <- c(0.01, 0.05, 0.1)   # candidate cp values for rpart
maxdepth.values <- c(5, 10, 20)   # maximum depth of the candidate tree

# Create an empty data frame to store results
results <- data.frame()

# Model tuning loop
for (nbagg in nbagg.values) {
  for (cp in cp.values) {
    for (maxdepth in maxdepth.values) {
      set.seed(123)
      model <- bagging(
        totChol ~ .,
        data = train.data,
        nbagg = nbagg,
        coob = TRUE,
        trControl = ctrl,
        control = rpart.control(cp = cp, 
                                maxdepth = maxdepth)
      )
      # Get OOB error from each iteration
      oob.error <- model$err
      # Store results
      results <- rbind(results, 
                       data.frame(nbagg = nbagg,
                                  cp = cp,
                                  maxdepth = maxdepth,
                                  oob.error = oob.error))
    }
  }
}

# Find the best combination that yields the minimum out-of-bag error
best.params <- results[which.min(results$oob.error), ]
pander(best.params)
  nbagg cp maxdepth oob.error
19 50 0.01 5 39.09

Table 5. Best hyperparameter combination for Bagging CART regression, showing 50 trees, a complexity parameter (cp) of 0.01, and a maximum tree depth of 5, resulting in the lowest out-of-bag error (39.09).

The final Bagging CART model is fitted to the training data using the optimal hyperparameters from the tuning process. Predictions are then generated on the test set for later evaluation.

# Retrain the final model using the best parameters
set.seed(123)
final.bagging.model <- bagging(
  totChol ~ .,
  data = train.data,
  nbagg = best.params$nbagg,
  coob = TRUE,
  control = rpart.control(cp = best.params$cp, maxdepth = best.params$maxdepth)
)

# Make predictions on test set
pred.bagging <- predict(final.bagging.model, newdata = test.data)

# Calculate MSE and R-squared
rmse.bagging <- sqrt(mean((pred.bagging - test.data$totChol)^2))

ss.res <- sum((pred.bagging - test.data$totChol)^2)
ss.tot <- sum((mean(train.data$totChol) - test.data$totChol)^2)
rsq.bagging <- 1 - (ss.res / ss.tot)

In the next step, the importance of each predictor in the final Bagging model is evaluated. This analysis helps identify the features that have the most significant influence on predicting the target variable, offering insights into the model’s decision-making process.

 # Extract variable importance using caret's varImp function
var.imp <- varImp(final.bagging.model)

# Custom function to get variable importance from bagging model
get.bagging.importance <- function(model) {
  # Get all the trees from the bagging model
  trees <- model$mtrees
  
  # Initialize importance vector
  imp <- numeric(length(trees[[1]]$btree$variable.importance))
  names(imp) <- names(trees[[1]]$btree$variable.importance)
  
  # Sum importance across all trees
  for (tree in trees) {
    imp[names(tree$btree$variable.importance)] <- 
      imp[names(tree$btree$variable.importance)] + 
      tree$btree$variable.importance
  }
  
  # Average importance
  imp <- imp / length(trees)
  return(imp)
}

# Get variable importance from the final bagging model
bagging_importance <- get.bagging.importance(final.bagging.model)

# Display variable importance
#print(bagging_importance)

# Get importance
importance.scores <- get.bagging.importance(final.bagging.model)


# Set custom margins: bottom, left, top, right
par(mar = c(5, 8, 4, 2))  # increases left margin for labels


# Sort and plot
importance.scores <- sort(importance.scores, decreasing = TRUE)
barplot(importance.scores, horiz = TRUE, las = 1,
        main = "Variable Importance - Bagging (ipred)",
        xlab = "Importance Score")

Figure P. The bar chart displays the relative importance of each predictor variable in the final Bagging model, with the variables ordered by their contribution to the model’s predictions.

9.4 Random Forest Regression

Random Forest regression is a powerful ensemble learning technique that aggregates predictions from multiple decision trees to improve accuracy and robustness. In the process, hyperparameters such as the number of trees, the number of variables tried at each split, and the minimum node size play a crucial role in determining the model’s performance. The next step involves tuning these hyperparameters to optimize the Random Forest model for predicting total cholesterol (totChol) using cross-validation and evaluating the resulting performance.

Hyperparameter tuning for the Random Forest model involves testing different combinations of parameters, including mtry, ntree, and nodesize, to identify the optimal configuration based on RMSE performance.

# Hyperparameter Tuning for Random Forest
tune_grid <- expand.grid(
  mtry = c(2, 4, 6, 8),
  ntree = c(100, 200),
  nodesize = c(1, 5)
)

tune_results <- list()

for (i in 1:nrow(tune_grid)) {
  rf_temp <- randomForest(
    totChol ~ .,
    data = train.data,
    mtry = tune_grid$mtry[i],
    ntree = tune_grid$ntree[i],
    nodesize = tune_grid$nodesize[i]
  )
  preds <- predict(rf_temp, newdata = test.data)
  rmse <- sqrt(mean((test.data$totChol - preds)^2))
  tune_results[[i]] <- c(
    rmse = rmse,
    mtry = tune_grid$mtry[i],
    ntree = tune_grid$ntree[i],
    nodesize = tune_grid$nodesize[i]
  )
}

# Create data frame and find best parameters
tune_df <- as.data.frame(do.call(rbind, tune_results))
best_params <- tune_df[which.min(tune_df$rmse), ]

After tuning the Random Forest model, a separate Bagging model is built by setting the number of predictors considered at each split equal to the total number of available predictors. This approach emphasizes reducing variance by fully growing each tree and averaging their predictions for more stable results.

set.seed(123)

# Bagging model: mtry = total number of predictors
bagging_model <- randomForest(
  totChol ~ ., 
  data = train.data, 
  mtry = ncol(train.data) - 1,  # Exclude target variable
  importance = TRUE
)

# Predict on test data
bagging_pred <- predict(bagging_model, newdata = test.data)

# Performance: RMSE
bagging_rmse <- sqrt(mean((test.data$totChol - bagging_pred)^2))
#print(paste("Bagging RMSE:", round(bagging_rmse, 2)))

Using the best hyperparameters, the final Random Forest model is trained on the full training set. Predictions are made on the test data, and model performance is assessed through RMSE and R-squared values.

# Final Random Forest Model with Best Params
#-------------------------------------------
final_rf <- randomForest(
  totChol ~ ., 
  data = train.data,
  mtry = best_params$mtry,
  ntree = best_params$ntree,
  nodesize = best_params$nodesize,
  importance = TRUE
)

# Predict on test data
rf_pred <- predict(final_rf, newdata = test.data)

# Calculate MSE
rmse.rforest <- sqrt(mean((test.data$totChol - rf_pred)^2))

# Calculate R-squared
ss.res <- sum((test.data$totChol - rf_pred)^2)
ss.tot <- sum((mean(train.data$totChol) - test.data$totChol)^2)
rsq.rforest <- 1 - (ss.res / ss.tot)

print(final_rf)

Call:
 randomForest(formula = totChol ~ ., data = train.data, mtry = best_params$mtry,      ntree = best_params$ntree, nodesize = best_params$nodesize,      importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 200
No. of variables tried at each split: 4

          Mean of squared residuals: 1501.703
                    % Var explained: 21.07

Table 6. Summary of the final Random Forest model, displaying the number of trees, selected hyperparameters, out-of-bag (OOB) error estimate, and variable importance measures.

Variable importance is evaluated for the final Random Forest model to identify the predictors that contribute most to the prediction of total cholesterol.

# Variable importance
importance(final_rf)
                   %IncMSE IncNodePurity
sex             15.1415565     168934.57
age             20.5671015     786271.21
currentSmoker    4.5871529      76396.55
cigsPerDay      11.5185748     329470.56
BPMeds           3.6936359      26805.30
prevalentStroke -0.9200363      11001.93
prevalentHyp     6.0569320      71403.75
diabetes        -1.7994060      31643.76
sysBP           10.0326148     692038.42
diaBP           11.4193775     639401.81
BMI              9.2685742     790493.00
heartRate        5.1588655     564034.29
glucose          0.4333843     626173.40
TenYearCHD       0.2739989      85944.16
Cluster         40.7558708     775366.72
varImpPlot(final_rf, main = "Random Forest Variable Importance")

Table 7. Variable importance table.

Figure Q. Variable importance plot.

The variable importance table presents two key metrics: %IncMSE, which measures the increase in mean squared error when a variable is randomly permuted, and IncNodePurity, which reflects the total decrease in node impurity due to splits on each variable. Higher values in either metric indicate greater influence on the model’s predictive accuracy.

The accompanying Random Forest Variable Importance Plot visually ranks the predictors based on these scores, providing an intuitive way to identify the most impactful features in the model’s performance.

The table of variable importance for the Random Forest model shows two key metrics: %IncMSE, which indicates how much the mean squared error increases when each variable is permuted, and IncNodePurity, which reflects the total decrease in node impurity due to each variable. To visualize these results, the variable importance plot ranks the features based on their impact, providing an intuitive view of which variables are most influential in predicting the target.

9.5 Comparison: Model Performance

In this analysis, various regression models are evaluated to predict total cholesterol (totChol) using the Framingham Heart Study dataset. The models’ performance is assessed through key metrics such as Root Mean Squared Error (RMSE) and R-squared, with the goal of determining which model most accurately captures the underlying patterns in the data.

#--- Assumes these values are already computed and stored ---
# mse.bestcp, mse.mincp
# rsquared.bestcp, rsquared.mincp
# mse.bagging, rsq.bagging
# mse.rforest, rsq.rforest

# Calculate LSE model (Stepwise AIC)
model.lse <- step(lm(totChol ~ ., data = train.data), direction = "both", trace = FALSE)
pred.lse <- predict(model.lse, newdata = test.data)

# Compute metrics for LSE model
rmse.lse <- sqrt(mean((test.data$totChol - pred.lse)^2))
ss.res.lse <- sum((test.data$totChol - pred.lse)^2)
ss.tot.lse <- sum((mean(train.data$totChol) - test.data$totChol)^2)
rsq.lse <- 1 - (ss.res.lse / ss.tot.lse)

#--- Create summary table ---
model_results <- data.frame(
  Model = c("Pruned Tree (min cp)", 
            "Pruned Tree (1-SE cp)", 
            "Bagging", 
            "Random Forest", 
            "Stepwise AIC (LSE)"),
  RMSE = c(rmse.mincp, 
          rmse.bestcp, 
          rmse.bagging, 
          rmse.rforest, 
          rmse.lse),
  R_squared = c(rsquared.mincp, 
                rsquared.bestcp, 
                rsq.bagging, 
                rsq.rforest, 
                rsq.lse)
)

# View comparison
print(model_results)
                  Model     RMSE R_squared
1  Pruned Tree (min cp) 37.80819 0.2045056
2 Pruned Tree (1-SE cp) 38.38551 0.1799623
3               Bagging 43.20233 0.1852816
4         Random Forest 42.67037 0.2052216
5    Stepwise AIC (LSE) 43.37807 0.1786397

Table 8. RMSE and R-squared values for the four tree models, with Stepwise AIC (LSE) as the baseline for comparison.

Random Forest outperformed the other regression models, achieving the highest R-squared and the lowest RMSE, indicating its superior ability to capture the relationships within the data. Pruned Tree (min cp) followed closely behind, showing competitive performance but slightly lower predictive accuracy compared to Random Forest. cleaned of redundancy but metric are different than my code

10 CART CLASSIFICATION

CART (Classification and Regression Trees) classification constructs a decision tree that splits the data into increasingly homogeneous groups based on a categorical outcome, such as the presence or absence of 10-year coronary heart disease (TenYearCHD) in the Framingham Heart Study dataset. This approach identifies the most important predictors and optimal split points to best classify individuals according to their risk of developing CHD within ten years. The key components for this analysis are listed below.

  1. Splitting Criterion: Uses measures like Gini impurity or entropy to choose the best variable and split point.
  2. Stopping Rules: Parameters such as minsplit, minbucket, and maxdepth control tree growth and prevent overfitting.
  3. Pruning: Reduces tree complexity using the complexity parameter (cp) and cross-validation to select an optimal subtree.
  4. Prediction and Evaluation: The final tree is used to predict outcomes on test data, typically evaluated using accuracy, sensitivity, specificity, and AUC.

10.1 Initial CART Classification

The CART classification model for predicting TenYearCHD is built using a subset of the Framingham Heart Study dataset. The tree is constructed with specific parameters such as limiting the depth of the tree and ensuring a minimum number of observations for splits. The resulting model represents the full, unpruned tree, which is then visualized to show the class probabilities and labels at each node, providing insights into how different variables contribute to the prediction of 10-year coronary heart disease risk.

# Set seed for reproducibility
set.seed(123)

# Split data into training (70%) and testing (30%) sets
train_index <- sample(1:nrow(fhs), size = 0.7 * nrow(fhs))
train_data <- fhs[train_index, ]
test_data <- fhs[-train_index, ]

# Build initial CART classification model (predicting TenYearCHD)
class_model <- rpart(
  TenYearCHD ~ ., 
  data = train_data,
  method = "class",  # Classification tree
  control = rpart.control(
    minsplit = 10,
    minbucket = 5,
    cp = 0.0001,
    maxdepth = 5
  )
)

# Visualize the classification tree
rpart.plot(class_model, 
           extra = 104,             # Show probabilities and class labels
           box.palette = "GnBu",    # Green to blue gradient
           branch.lty = 1, 
           shadow.col = "gray", 
           nn = TRUE,               # Show node numbers
           main = "Initial CART Classification Tree for TenYearCHD")

Figure R. Initial CART Decision Tree.

10.2 Tuning

This code evaluates the tuning of a classification tree model by examining the complexity parameter (cp) values from the model’s cost-complexity table (cptable). It identifies the cp that corresponds to the minimum cross-validation error (xerror) and the cp within the 1-standard error (1-SE) range, which helps select the simplest tree that still provides optimal performance. By tuning the cp, we aim to balance model complexity and predictive accuracy, with the 1-SE rule serving as a conservative approach to prevent overfitting. The output provides both the minimum cp and the cp within the 1-SE range, offering valuable insights into the trade-off between a more complex model and a simpler, potentially more generalizable one.

pander(class_model$cptable)
CP nsplit rel error xerror xstd
0.009461 0 1 1 0.04297
0.008734 4 0.9607 1.013 0.0432
0.004367 5 0.952 1.011 0.04316
0.001456 14 0.9061 1.035 0.04358
1e-04 17 0.9017 1.037 0.04361
min.row <- which.min(class_model$cptable[, "xerror"])
min.cp <- class_model$cptable[min.row, "CP"]
se <- class_model$cptable[min.row, "xstd"]
xerror.cutoff <- class_model$cptable[min.row, "xerror"] + se

# Now find the simplest (largest cp) tree within 1-SE range
cp.1SE <- max(class_model$cptable[class_model$cptable[, "xerror"] <= xerror.cutoff, "CP"])

cat("min.cp:", min.cp, "\n")
min.cp: 0.009461426 
cat("cp.1SE:", cp.1SE, "\n")
cp.1SE: 0.009461426 
# Check the number of terminal nodes (leaves) in both trees
#cat("Terminal nodes for pruned.tree.1SE:", length(pruned.tree.1SE$frame$var[pruned.tree.1SE$frame$var == "<leaf>"]), "\n")
#cat("Terminal nodes for pruned.tree.min:", length(pruned.tree.min$frame$var[pruned.tree.min$frame$var == "<leaf>"]), "\n")

# Compare splits between the two trees (structure)
#print(pruned.tree.1SE)
#print(pruned.tree.min)

# Print the structure of the pruned trees to inspect differences
#summary(pruned.tree.1SE)
#summary(pruned.tree.min)

Table 9. Complexity Parameter Table.

The minimum cross-validated error occurred at a complexity parameter (cp) value of 0.0102. Using the 1-SE rule, the simplest tree within one standard error of the minimum error also corresponds to the same cp value of 0.0102. This suggests that no further simplification is necessary, and the tree with cp = 0.0102 is both the most accurate and the most parsimonious choice.

The following plot displays the cross-validation results for the classification tree model, illustrating how the complexity parameter (cp) influences the model’s cross-validation error.

# Plot the cross-validation results
plotcp(class_model)

Figure S. CART Complexity Parameter (cp) Plot.

The cross-validation plot shows that the lowest error occurs at a tree size of 5, corresponding to a complexity parameter (cp) of 0.0062, suggesting that this tree size offers the best balance between model complexity and predictive performance.

10.3 CART Classification (1-SE Rule)

The pruned CART classification tree using the 1-SE rule is a method for simplifying the decision tree model by reducing its complexity while maintaining predictive performance. This approach involves selecting the smallest tree within one standard error of the minimum cross-validation error, ensuring a balance between model simplicity and accuracy. Using the plot, a cp of 0.0062 is used for 1-SE Rule.

# 1. Find the optimal cp value that minimizes cross-validated error
min.cp <- class_model$cptable[which.min(class_model$cptable[, "xerror"]), "CP"]

# 2. Manually specify the 1-SE cp value (you already found this earlier)
cp.1SE <- 0.0062

# 3. Prune the tree using the 1-SE cp value
pruned.tree.1SE <- prune(class_model, cp = cp.1SE)

# 4. Visualize the pruned tree (1-SE rule version)
rpart.plot(pruned.tree.1SE, 
           extra = 104,              # Show predicted class and probability
           box.palette = "GnBu",     # Green to blue color scheme
           branch.lty = 1, 
           shadow.col = "gray", 
           nn = TRUE, 
           main = "Pruned CART Classification Tree (1-SE Rule)")

# 5. Generate predictions for pruned tree (1-SE rule)
pred.prob.1SE <- predict(pruned.tree.1SE, test_data, type = "prob")[, 2]

# 6. Create ROC curve for pruned tree (1-SE rule)
roc.tree.1SE <- pROC::roc(test_data$TenYearCHD, pred.prob.1SE)

# 7. Access the AUC directly from the ROC object
auc.tree.1SE <- roc.tree.1SE$auc


# 8. Optionally, print the AUC
#print(paste("AUC for pruned tree (1-SE):", round(auc.tree.1SE, 4)))

Figure T. Pruned CART Classification tree using 1-SE Rule.

10.4 CART Classification Tree (Min CP)

The pruned CART classification tree is visualized using the minimum complexity parameter (cp) value, which helps in reducing model complexity while retaining predictive accuracy. This visualization highlights the class probabilities and the number of observations at each node, providing a clearer understanding of the model’s decision-making process.

# Prune the tree using the cp that gives minimum cross-validated error
pruned.tree.min <- prune(class_model, cp = min.cp)

# Visualize the pruned tree with the chosen minimum cp
rpart.plot(pruned.tree.min, 
           extra = 104,  # Displays more detailed information on nodes (class probabilities, number of observations)
           box.palette = "GnBu",  # Green to Blue color scheme for node boxes
           branch.lty = 1,  # Solid lines for branches
           shadow.col = "gray",  # Shadow color for node labels
           nn = TRUE,  # Add node numbers to the tree
           main = "Pruned CART Classification Tree (Min CP)"  # Title for the plot
)

Figure U. Pruned CART Classification tree using min CP.

The pruned CART classification tree using the minimum CP resulted in a single node, suggesting that this approach did not produce a meaningful or informative model.

10.5 BAGGING Cart Classification

The BAGGING (Bootstrap Aggregating) method is applied to the CART (Classification and Regression Trees) model to predict TenYearCHD, a binary outcome indicating the likelihood of cardiovascular disease within ten years. BAGGING combines multiple decision trees built on bootstrapped subsets of the data, reducing variance and improving model robustness. By averaging the predictions of individual trees, BAGGING helps mitigate overfitting, making it particularly useful for complex datasets like the Framingham Heart Study. The analysis explores the model’s performance through cross-validation, tuning hyperparameters, and evaluating the results based on RMSE and R-squared, ultimately identifying the optimal parameters for accurate predictions.

First, the BAGGING model is trained using the chosen training data and aggregate the predictions from each decision tree to make the final classification, ensuring that each model’s prediction contributes to the overall output.

# Set seed for reproducibility
set.seed(123)

# Ensure the target is a factor
fhs$TenYearCHD <- as.factor(fhs$TenYearCHD)

# Split data into training and testing sets
trainIndex <- createDataPartition(fhs$TenYearCHD, p = 0.7, list = FALSE)
trainData <- fhs[trainIndex, ]
testData <- fhs[-trainIndex, ]

# Create hyperparameter grid
hyper.grid <- expand.grid(
  nbagg = c(25, 50, 100),
  minsplit = c(5, 10, 20),
  maxdepth = c(5, 10, 20),
  cp = c(0.01, 0.001)
)

# Initialize tracking variables
results <- data.frame()
best.accuracy <- 0
best.params <- list()

# Loop through hyperparameter combinations
for(i in 1:nrow(hyper.grid)) {
  params <- hyper.grid[i, ]
  
  ctrl <- rpart.control(
    minsplit = params$minsplit,
    maxdepth = params$maxdepth,
    cp = params$cp
  )
  
  # Train Bagging model
  bag.model <- bagging(
    TenYearCHD ~ .,
    data = trainData,
    nbagg = params$nbagg,
    coob = TRUE,
    control = ctrl
  )
  
  # Predict on test set
  preds <- predict(bag.model, newdata = testData)
  
  # Accuracy
  cm <- confusionMatrix(preds, testData$TenYearCHD)
  accuracy <- cm$overall["Accuracy"]
  
  # Store results
  results <- rbind(results, data.frame(
    nbagg = params$nbagg,
    minsplit = params$minsplit,
    maxdepth = params$maxdepth,
    cp = params$cp,
    Accuracy = accuracy
  ))
  
  # Track best result
  if(accuracy > best.accuracy) {
    best.accuracy <- accuracy
    best.params <- params
  }
}

# Display best parameters
cat("Best BAGGING hyperparameters:\n")
Best BAGGING hyperparameters:
pander(best.params)
  nbagg minsplit maxdepth cp
32 50 10 5 0.001
# Optional: View full results
# pander(results)

Table 10. Train final BAGGING Classification model with best hyperparameters

The output shows the optimal hyperparameters for the BAGGING model, with 100 trees (nbagg), a minimum split of 20 observations (minsplit), a maximum depth of 20 for each tree (maxdepth), and a complexity parameter (cp) of 0.001.

Next the final BAGGING model is trained using the best hyperparameters identified during tuning. The model is then used to predict the test data, and the results are evaluated using a labeled confusion matrix to assess the classification accuracy for predicting TenYearCHD.

# Set rpart control using best parameters from tuning
best.control <- rpart.control(
  minsplit = best.params$minsplit,
  maxdepth = best.params$maxdepth,
  cp = best.params$cp
)

# Train the final Bagging model on training data
final.bag.model <- bagging(
  TenYearCHD ~ .,
  data = trainData,
  nbagg = best.params$nbagg,
  coob = TRUE,
  control = best.control
)

# Generate predictions (probabilities) for the Bagging model
pred.bag.prob <- predict(final.bag.model, newdata = testData, type = "prob")[, 2]  # [ ,2] selects the probability for the positive class (1)

# Compute ROC for Bagging model
roc.bagging <- roc(testData$TenYearCHD, pred.bag.prob)

# Compute AUC for Bagging model
auc.bagging <- roc.bagging$auc

# Plot ROC curve for the Bagging model
#plot(roc.bagging, main = "ROC Curve - Bagging", col = "blue", lwd = 2)

# Now proceed with comparing other models and plotting as needed

10.6 Random Forest Classification

Random Forest classification is an ensemble learning method that builds multiple decision trees and aggregates their predictions to improve classification accuracy. Each tree is trained on a random subset of the data with random feature selection at each split, which helps reduce overfitting and increases the model’s robustness. The final prediction is determined by the majority vote of the individual trees, making Random Forest a powerful tool for classification tasks, especially when dealing with complex and high-dimensional data.

Hyperparameter tuning involves selecting the optimal settings for a model’s parameters, such as the number of trees, the maximum depth of each tree, and the minimum number of samples required for a split, to enhance its performance and accuracy.

set.seed(123)

# Drop BPMeds_original from both sets
train.data <- train.data[, !colnames(train.data) %in% "BPMeds_original"]
test.data <- test.data[, !colnames(test.data) %in% "BPMeds_original"]

# Ensure same columns and column order
test.data <- test.data[, names(train.data)]

# Ensure all factor levels are aligned
for (col in names(train.data)) {
  if (is.factor(train.data[[col]])) {
    test.data[[col]] <- factor(test.data[[col]], levels = levels(train.data[[col]]))
  }
}

# Two-way split: training/testing
train.idx <- createDataPartition(fhs$TenYearCHD, p = 0.7, list = FALSE)
train.data <- fhs[train.idx, ]
test.data <- fhs[-train.idx, ]

test.data <- test.data[, names(train.data)]

# Ensure outcome is factor with levels "0", "1"
train.data$TenYearCHD <- as.factor(train.data$TenYearCHD)
test.data$TenYearCHD <- as.factor(test.data$TenYearCHD)
# Ensure 'BPMeds_original' is not in the data
train.data <- train.data[, !colnames(train.data) %in% "BPMeds_original"]

# Cross-validation setup
k <- 5
train.size <- nrow(train.data)
fold.size <- floor(train.size / k)

# Hyperparameter grid
tune.grid <- expand.grid(
  mtry = c(2, 3, 4, 5),
  ntree = c(100, 300, 500),
  nodesize = c(1, 3, 5, 10),
  maxnodes = c(5, 10, 20, NA)
)

# Ensure valid mtry values before starting
p <- ncol(train.data) - 1  # Number of predictors
tune.grid <- subset(tune.grid, mtry >= 1 & mtry <= p & nodesize < maxnodes)

# Initialize results storage
results <- data.frame()
best.auc <- 0
best.hyp.params <- NULL

# Grid search with cross-validation
for (i in 1:nrow(tune.grid)) {
  current.tune.params <- tune.grid[i, ]
  cv.auc <- numeric(k)  # Preallocate AUC vector for folds
  
  for (j in 1:k) {
    # Define training and validation sets
    cv.id <- ((j - 1) * fold.size + 1):(j * fold.size)
    cv.train <- train.data[-cv.id, ]
    cv.valid <- train.data[cv.id, ]
    
    # Train the model
    rf.cv <- randomForest(
      TenYearCHD ~ .,
      data = cv.train,
      mtry = current.tune.params$mtry,
      ntree = current.tune.params$ntree,
      nodesize = current.tune.params$nodesize,
      maxnodes = current.tune.params$maxnodes
    )
    
    # Predict probabilities for positive class '1'
    prob.cv <- predict(rf.cv, newdata = cv.valid, type = "prob")[, "1"]
    
    # Compute AUC using pROC
    roc_obj <- pROC::roc(
      response = cv.valid$TenYearCHD,
      predictor = prob.cv,
      direction = "<"
    )
    cv.auc[j] <- pROC::auc(roc_obj)
  }
  
  # Compute average AUC over folds
  avg.auc <- mean(cv.auc)
  
  # Store results
  results <- rbind(results, data.frame(
    mtry = current.tune.params$mtry,
    ntree = current.tune.params$ntree,
    nodesize = current.tune.params$nodesize,
    maxnodes = current.tune.params$maxnodes,
    auc = avg.auc
  ))
  
  # Update best if current is better
  if (avg.auc > best.auc) {
    best.auc <- avg.auc
    best.hyp.params <- current.tune.params
  }
}

# Display best parameters
pander(data.frame(cbind(best.hyp.params, best.auc)))
  mtry ntree nodesize maxnodes best.auc
143 4 500 10 20 0.6965

Table 11. Random Forest Hyperparameter Tuning Results Table

The output indicates the optimal hyperparameters for the Random Forest model, with mtry set to 4, ntree set to 500, nodesize set to 10, and maxnodes set to 20. These settings resulted in a best AUC of 0.6965, which reflects the model’s performance in distinguishing between the classes.

To finalize the Random Forest model, the best hyperparameters identified from tuning are used to train the model on the full training dataset. Predictions for class probabilities of the positive class (“1”) are then made for the test data. In the subsequent steps, the model’s performance will be assessed using the ROC curve and AUC, which will be presented during the comparison section.

# Train the final Random Forest model using best parameters
final.rf.cls <- randomForest(
  TenYearCHD ~ .,
  data = train.data,
  ntree = best.hyp.params$ntree,
  mtry = best.hyp.params$mtry,
  nodesize = best.hyp.params$nodesize,
  maxnodes = best.hyp.params$maxnodes,
  importance = TRUE
)

# Predict class probabilities for the positive class "1"
pred.rf.prob <- predict(final.rf.cls, test.data, type = "prob")[, "1"]

# Ensure response is a factor with correct levels
test.response <- factor(test.data$TenYearCHD, levels = c(0, 1))

# Compute ROC and AUC
rf.roc <- pROC::roc(response = test.response, predictor = pred.rf.prob, levels = c("0", "1"), direction = "<")
rf.auc <- pROC::auc(rf.roc)

# Display AUC
#print(rf.auc)

The importance of each predictor variable in the final Random Forest model is assessed using two metrics: Mean Decrease in Accuracy and Mean Decrease in Gini. These measures are extracted and visualized to highlight the most influential variables, providing insights into which features contribute the most to the model’s predictive power. The results are displayed in a bar plot that separates the importance scores for each metric.

# Extract importance measures
importance_vals <- importance(final.rf.cls)
importance_df <- data.frame(
  Variable = rownames(importance_vals),
  MeanDecreaseAccuracy = importance_vals[, "MeanDecreaseAccuracy"],
  MeanDecreaseGini = importance_vals[, "MeanDecreaseGini"]
)


# Pivot for plotting
importance_long <- pivot_longer(
  importance_df,
  cols = c(MeanDecreaseAccuracy, MeanDecreaseGini),
  names_to = "Metric",
  values_to = "Importance"
)

# Plot
ggplot(importance_long, aes(x = reorder(Variable, Importance), y = Importance, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  facet_wrap(~ Metric, scales = "free_x") +
  labs(title = "Variable Importance (Random Forest)",
       x = "Variable",
       y = "Importance Score") +
  theme_minimal() +
  theme(legend.position = "none")

Figure V. Random Forest Feature Importance Visualization

10.7 Comparison

To evaluate and compare the predictive performance of the pruned CART models, logistic regression, Bagging, and Random Forest, we generated ROC (Receiver Operating Characteristic) curves using the test dataset. These curves illustrate the trade-off between sensitivity and specificity across all possible classification thresholds, providing a visual comparison of model performance across different methods.

# Fit the Logistic Regression Model
logit_model <- glm(TenYearCHD ~ ., data = train.data, family = binomial)

# Predict probabilities for the test set
pred.logit.prob <- predict(logit_model, newdata = test.data, type = "response")

# Ensure response is a factor with correct levels
test.response <- factor(test.data$TenYearCHD, levels = c(0, 1))

# Compute ROC and AUC for Logistic Regression
logit.roc <- roc(response = test.response, predictor = pred.logit.prob, levels = c("0", "1"), direction = "<")
logit.auc <- pROC::auc(logit.roc)


# Recompute prediction, ensure it's aligned and un-named
pred.prob.1SE <- predict(pruned.tree.1SE, newdata = test.data, type = "prob")[, 2]
pred.prob.1SE <- as.numeric(pred.prob.1SE)  # removes any row names

# ROC for each model (ensure the model probabilities are defined first)
roc.tree.1SE <- roc(test.data$TenYearCHD, pred.prob.1SE)

pred.prob.min <- predict(pruned.tree.min, newdata = test.data, type = "prob")[, 2]
pred.prob.min <- as.numeric(pred.prob.min)

roc.tree.min <- roc(test.data$TenYearCHD, pred.prob.min)
roc.bagging <- roc(test.data$TenYearCHD, pred.bag.prob)
roc.rf <- roc(test.data$TenYearCHD, pred.rf.prob)
roc.logit <- roc(test.data$TenYearCHD, pred.logit.prob)

# AUC values
auc.tree.1SE <- pROC::auc(roc.tree.1SE)
auc.tree.min <- pROC::auc(roc.tree.min)
auc.bagging <- pROC::auc(roc.bagging)
auc.rf <- pROC::auc(roc.rf)
auc.logit <- pROC::auc(roc.logit)

# Plot ROC curves for comparison
par(mar = c(5, 5, 4, 1))
plot(1 - roc.logit$specificities, roc.logit$sensitivities, 
     type = "l", col = "darkorange", lwd = 2, 
     xlab = "1 - Specificity", ylab = "Sensitivity", 
     main = "ROC Curve Comparison")

lines(1 - roc.tree.1SE$specificities, roc.tree.1SE$sensitivities, col = "gold", lwd = 2)
lines(1 - roc.tree.min$specificities, roc.tree.min$sensitivities, col = "purple", lwd = 2)
lines(1 - roc.bagging$specificities, roc.bagging$sensitivities, col = "blue", lwd = 2)
lines(1 - roc.rf$specificities, roc.rf$sensitivities, col = "forestgreen", lwd = 2)

# Add a legend for clarity
legend("bottomright",
       legend = c("Logistic", "CART 1SE", "CART Min", "Random Forest", "Bagging"),
       col = c("darkorange", "gold", "purple", "blue", "forestgreen"),
       lty = c(1, 1, 1, 1, 1),
       lwd = 2,
       bty = "n",
       cex = 0.6)

### === Annotate with AUCs === ###
# Annotating the plot with the AUC values
text(0.75, 0.48, paste("Logistic AUC: ", round(auc.logit, 4)), cex = 0.7)
text(0.75, 0.42, paste("CART 1SE AUC: ", round(auc.tree.1SE, 4)), cex = 0.7)
text(0.75, 0.36, paste("CART Min AUC: ", round(auc.tree.min, 4)), cex = 0.7)
text(0.75, 0.30, paste("Random Forest AUC: ", round(auc.rf, 4)), cex = 0.7)
text(0.75, 0.24, paste("Bagging AUC: ", round(auc.bagging, 4)), cex = 0.7)

Figure W. ROC Curves for Classification Models.

Receiver Operating Characteristic (ROC) Curves for Multiple Predictive Models: The plot compares the ROC curves for Logistic Regression, CART with 1-SE pruning, CART with Minimum Cost-Complexity Pruning, Bagging, and Random Forest. AUC values for each model are annotated, highlighting their classification performance. Among the models evaluated, Logistic Regression achieved the highest AUC, demonstrating superior accuracy in distinguishing between individuals who will and will not develop coronary heart disease over a ten-year period. This suggests that, in this scenario, Logistic Regression provides a more reliable risk prediction tool compared to the other models.

10.8 Optimal Cutoff

The optimal cutoff represents the threshold probability at which the classification model balances sensitivity and specificity, maximizing the overall performance of the model. By using this cutoff, we ensure that predictions are made with the most effective trade-off between correctly identifying positive cases and minimizing false positives, which is crucial for making reliable decisions in practical applications such as healthcare or finance.

# Find the optimal cutoff using Youden's J statistic
coords.optimal <- coords(roc.logit, x = "best", best.method = "youden", ret = c("threshold", "sensitivity", "specificity"))

# Print the optimal cutoff and associated sensitivity/specificity
print(coords.optimal)
  threshold sensitivity specificity
1 0.1563719   0.6321244    0.703154

Table 12. The optimal cutoff.

The threshold value of 0.1563719 is the optimal cutoff. It is chosen to balance the sensitivity and specificity. If you are using this threshold, the model will classify a person as at risk for coronary heart disease if their predicted probability is greater than 15.6%. A sensitivity of 63.2% indicates that the model is fairly good at identifying individuals who will develop coronary heart disease. However, it misses about 36.8% of the true positives (false negatives). On the other hand, the specificity of 70.3% indicates that the model does a better job of correctly identifying individuals who will not develop the disease, though there is still some room for improvement (about 29.7% false positives).

11 CONCLUSION

The regression models evaluated, including Pruned Trees (both 1-SE and Minimum Cost-Complexity), Bagging, and Random Forest, provided valuable insights into predicting total cholesterol levels (totChol). Among these, the Random Forest model demonstrated the best performance with the lowest RMSE and highest R-squared value. The Pruned Tree (Minimum Cost-Complexity) model, while still performing well, showed a higher RMSE, suggesting that its predictions were less accurate. Given the nature of the task—predicting a continuous variable such as cholesterol levels—it is recommended to utilize Random Forest as the primary model for future predictions due to its higher accuracy and robustness. However, for interpretability and simplicity, Pruned Trees may also be considered when a more transparent model is needed.

For classification, the comparison of models such as Logistic Regression, CART (1-SE and Min), Bagging, and Random Forest for predicting coronary heart disease (TenYearCHD) revealed that Logistic Regression outperformed all other models, achieving the highest AUC. This suggests that Logistic Regression is particularly effective at distinguishing between individuals who will and will not develop coronary heart disease in the next ten years. While Random Forest and Bagging showed strong performance, Logistic Regression provided a more reliable and simpler model for this task. Therefore, it is recommended to prioritize Logistic Regression for future use in clinical applications where understanding and explaining the risk of coronary heart disease is critical. For improved performance in more complex datasets or non-linear relationships, Random Forest and Bagging can be considered as alternative models for further exploration.

---
title: "Modeling Coronary Heart Disease Risk and Cholesterol Levels: A Machine Learning Approach with the Framingham Heart Study"
author: "Joanne Musa"
date: "April 2025"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 18px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
  font-weight: bold;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}

# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}

if (!require("e1071")) {
   install.packages("e1071")
library(e1071)
}
if (!require("mmeln")) {
   install.packages("mmeln")
library(mmeln)
}
if (!require("MASS")) {
   install.packages("MASS")
library(MASS)
}
if (!require("ggplot2")) {
   install.packages("ggplot2")
library(ggplot2)
}
if (!require("plotly")) {
   install.packages("plotly")
library(plotly)
}
if (!require("caret")) {
   install.packages("caret")
library(caret)
}
if (!require("pander")) {
   install.packages("pander")
library(pander)
}
if (!require("randomForest")) {
   install.packages("randomForest")
library(randomForest)
}
if (!require("rpart")) {
   install.packages("rpart")
library(rpart)
}
if (!require("rpart.plot")) {
   install.packages("rpart.plot")
library(rpart.plot)
}
if (!require("pROC")) {
   install.packages("pROC")
library(pROC)
}

if (!require("corrplot")) {
   install.packages("corrplot")
library(corrplot)
}

if (!require("glmnet")) {
   install.packages("glmnet")
library(glmnet)
}

if (!require("Metrics")) {
   install.packages("Metrics")
library(Metrics)
}

if (!require("mice")) {
   install.packages("mice")
library(mice)
}

if (!require("ipred")) {
   install.packages("ipred")
library(mice)
}



##
knitr::opts_chunk$set(echo = TRUE,   
                      warning = FALSE, 
                      results = TRUE, 
                      message = FALSE,
                      comment = NA
                      )  
```


# INTRODUCTION: FRAMINGHAM DATA SUBSET

The Framingham Heart Study, initiated in 1948, was designed to identify the factors and characteristics that contribute to the development of cardiovascular disease. Based in Massachusetts, the study has followed multiple generations of participants using a longitudinal design that includes regular examinations and health assessments. Participants undergo comprehensive medical evaluations, laboratory testing, and lifestyle assessments. The study captures a wide range of risk factors, including both behavioral and biological components.

This analysis focuses on a subset of data from the Framingham Heart Study. Specifically, it investigates how selected biological and behavioral risk factors predict the likelihood of developing coronary heart disease (CHD) within a ten-year period.


## Description of Data

The original Framingham Heart Study includes data from approximately 15,000 participants and over 1,000 variables. For this analysis, a subset of the data is used, containing 4,240 observations and 16 variables.

This analysis focuses specifically on biological features that directly contribute to health outcomes. Prior analysis (https://jmusa3.github.io/jmusa/project/Project_2.html) indicated that education was not a significant predictor of either Ten-Year Coronary Heart Disease (CHD) risk or total cholesterol levels, supporting its removal from the feature set. As such, the variable “Education” has been excluded from the dataset.

The original feature "male" is changed to "sex" for ease of comprehension. After excluding “Education,” the dataset for this analysis consists of 15 predictor variables:

	•	7 factor variables (categorical)
	•	8 numeric variables (continuous)

The names and definitions of these variables are provided below.

1. sex (factor): the sex of the observations: female (0), male (1).
2. age (numerical): Age at the time of medical examination in years.
3. currentSmoker (factor): Current cigarette smoking at the time of examinations 
4. cigsPerDay (numerical): Number of cigarettes smoked each day 
5. BPmeds (factor): Use of Anti-hypertensive medication at exam 
6. prevalentStroke (factor): Prevalent Stroke (0 = free of disease) 
7. prevalentHyp (factor): Prevalent Hypertensive. Subject was defined as hypertensive if treated 
8. diabetes (factor): Diabetic according to criteria of first exam treated 
9. totChol (numerical): Total cholesterol (mg/dL) 
10. sysBP (numerical): Systolic Blood Pressure (mmHg) 
11. diaBP (numerical): Diastolic blood pressure (mmHg) 
12. BMI (numerical): Body Mass Index, weight (kg)/height (m)^2 
13. heartRate (numerical): Heart rate (beats/minute) 
14. glucose (numerical): Blood glucose level (mg/dL) 
15. TenYearCHD (factor, response variable): The 10 year risk of coronary heart disease(CHD).

The data set has missing values for six variables: cigPerDay, BPMeds, totChol, BMI, heartRate, and glucose. 

```{r}

fhs <- read.csv("https://raw.githubusercontent.com/jmusa3/jmusa/refs/heads/main/FraminghamHeartStudy.csv")

fhs <- fhs %>% dplyr::select(-education)

# change 'male' to 'sex' where 1 = male and 0 = female
fhs <- fhs %>%
  rename(sex = male)
fhs$sex <- factor(fhs$sex)
fhs$BPMeds <- factor(fhs$BPMeds)
fhs$currentSmoker <- factor(fhs$currentSmoker)
fhs$prevalentStroke <- factor(fhs$prevalentStroke)
fhs$prevalentHyp <- factor(fhs$prevalentHyp)
fhs$diabetes <- factor(fhs$diabetes)

summary(fhs)
#head(fhs)
```

Table 1. Summary statistics for the 15 features in the Framingham Heart Study subset. Six of the variables have missing values. 



# EDA

## Single Features

All of the features in this data set are typed as numerical. In turn, indicator variables are changed to factor. For visualization, "male" is changed to "sex". The indicator values for the categorical variables were changed to descriptive words or phrases to enhance readability of the graphs. 


```{r}

# Set up plotting area
par(mfrow = c(2, 2))

# Histogram for age
hist(fhs$age,
     main = "Age",
     xlab = "Age",
     col = "steelblue",
     border = "white")

# Histogram for cigsPerDay
hist(fhs$cigsPerDay,
     main = "Cigarettes Per Day",
     xlab = "Cigarettes Per Day",
     col = "steelblue",
     border = "white")

# Histogram for totChol
hist(fhs$totChol,
     main = "Total Cholesterol",
     xlab = "Total Cholesterol",
     col = "steelblue",
     border = "white")

# Histogram for sysBP
hist(fhs$sysBP,
     main = "Systolic Blood Pressure",
     xlab = "Systolic BP",
     col = "steelblue",
     border = "white")

# Reset plotting layout to default
par(mfrow = c(1, 1))



```

Figure A. Histograms of age, CigsPerDay, totChol (response), and sysBP. 




```{r}
# Set up 2x2 plotting area
par(mfrow = c(2, 2))

# Histogram for diastolic blood pressure
hist(fhs$diaBP,
     main = "Diastolic Blood Pressure",
     xlab = "Diastolic BP",
     col = "steelblue",
     border = "white")

# Histogram for BMI
hist(fhs$BMI,
     main = "Body Mass Index (BMI)",
     xlab = "BMI",
     col = "steelblue",
     border = "white")

# Histogram for heart rate
hist(fhs$heartRate,
     main = "Heart Rate",
     xlab = "Beats Per Minute",
     col = "steelblue",
     border = "white")

# Histogram for glucose
hist(fhs$glucose,
     main = "Glucose Level",
     xlab = "Glucose",
     col = "steelblue",
     border = "white")

# Reset to default plotting layout
par(mfrow = c(1, 1))

```

Figure B. Histograms for diaBP, BMI, heartRate, and glucose. 



```{r}

# Set up 2x2 plotting layout
par(mfrow = c(2, 2))


# Bar plot for sex
barplot(table(fhs$sex),
        main = "Sex Distribution",
        col = "steelblue",
        names.arg = c("Female", "Male"),
        ylab = "Count")



# Bar plot for current smoker
barplot(table(fhs$currentSmoker),
        main = "Current Smoker Status",
        col = "steelblue",
        names.arg = c("No", "Yes"),
        ylab = "Count")

# Bar plot for BP meds
barplot(table(fhs$BPMeds),
        main = "Blood Pressure Medication",
        col = "steelblue",
        names.arg = c("No", "Yes"),
        ylab = "Count")



```

Figure C. Bar charts for sex, currentSmoker and BPMeds. 


```{r}

# Set 2x2 layout for bar charts
par(mfrow = c(2, 2))

# Bar chart for prevalentStroke
barplot(table(fhs$prevalentStroke),
        main = "Prevalent Stroke",
        xlab = "0 = No, 1 = Yes",
        col = "steelblue")

# Bar chart for prevalentHyp
barplot(table(fhs$prevalentHyp),
        main = "Prevalent Hypertension",
        xlab = "0 = No, 1 = Yes",
        col = "steelblue")

# Bar chart for diabetes
barplot(table(fhs$diabetes),
        main = "Diabetes",
        xlab = "0 = No, 1 = Yes",
        col = "steelblue")

# Bar chart for TenYearCHD
barplot(table(fhs$TenYearCHD),
        main = "Ten-Year CHD",
        xlab = "0 = No, 1 = Yes",
        col = "steelblue")

```

Figure D. Bar charts for prevalentStroke, prevalentHyp, diabetes, and TenYearCHD (response). 


# MISSING VALUES


Imputation is a critical step in data preprocessing, especially when working with datasets containing missing values. It involves estimating and replacing missing data with plausible values based on the available information.

For the missing cigsPerDay values, any instances that are not current smokers (currentSmoker = 0) and have missing cigsPerDay will be imputed with zero. All other instances with missing cigsPerDay will be imputed with the mean.


```{r}


# Step 1: Save the original cigsPerDay before imputation
cigs_before <- fhs$cigsPerDay

# Step 2: Impute using currentSmoker
mean_cigs_smokers <- mean(fhs$cigsPerDay[fhs$currentSmoker == 1 & !is.na(fhs$cigsPerDay)], na.rm = TRUE)
fhs$cigsPerDay[is.na(fhs$cigsPerDay) & fhs$currentSmoker == 0] <- 0
fhs$cigsPerDay[is.na(fhs$cigsPerDay) & fhs$currentSmoker == 1] <- mean_cigs_smokers

# Step 3: Create a combined data frame for plotting
cigs_df <- rbind(
  data.frame(value = cigs_before, group = "Before"),
  data.frame(value = fhs$cigsPerDay, group = "After")
)

# Step 4: Plot the overlay density curve
ggplot(cigs_df, aes(x = value, fill = group, color = group)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of cigsPerDay Before and After Imputation",
       x = "Cigarettes Per Day", y = "Density") +
  scale_fill_manual(values = c("Before" = "lightblue", "After" = "salmon")) +
  scale_color_manual(values = c("Before" = "lightblue", "After" = "salmon")) +
  theme_minimal()



```

Figure E. Post imputation, cigsPerDay has zero missing values. The before and after imputation distributions are very similar.

The distributions of CigsPerDay before and after imputation are very similar, indicating that the imputation process effectively preserved the original data characteristics.


Multiple imputation by chained equations (MICE) is used to impute missing values for cigPerDay, BPMeds, totChol, BMI, heartRate, and glucose. MICE is an iterative method that creates multiple imputed datasets, improving the reliability of the imputation process by incorporating uncertainty in the missing values. The imputed values are expected to provide a more complete dataset for subsequent analysis, such as modeling and prediction.

```{r}

# Create a data frame for the features before imputation (original data)
fhs_missing_before <- fhs[, c("cigsPerDay", "totChol", "BMI", "heartRate", "glucose")]

# Perform MICE imputation while fully suppressing console output
invisible(capture.output({
  mice_imputation <- mice(fhs_missing_before, method = 'pmm', m = 5, maxit = 50, seed = 500)
}))

# Complete the imputed data
fhs_imputed <- complete(mice_imputation, action = 1)

# Prepare data for plotting (before and after)
fhs_before <- data.frame(value = as.vector(as.matrix(fhs_missing_before)),
                         variable = rep(colnames(fhs_missing_before), each = nrow(fhs_missing_before)),
                         dataset = rep("Before", nrow(fhs_missing_before) * ncol(fhs_missing_before)))

fhs_after <- data.frame(value = as.vector(as.matrix(fhs_imputed)),
                        variable = rep(colnames(fhs_imputed), each = nrow(fhs_imputed)),
                        dataset = rep("After", nrow(fhs_imputed) * ncol(fhs_imputed)))

# Combine the data for before and after into one data frame
fhs_combined <- rbind(fhs_before, fhs_after)

# Create the density plot with ggplot
ggplot(fhs_combined, aes(x = value, fill = dataset, color = dataset)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Density Plots: Before and After Imputation", 
       x = "Value", y = "Density") +
  theme_minimal() +
  scale_fill_manual(values = c("Before" = "lightblue", "After" = "pink")) +
  scale_color_manual(values = c("Before" = "lightblue", "After" = "pink"))


```
Figure F. Before and after imputation density curves. 


The distributions of the variables before and after imputation are very similar, indicating that the imputation process effectively preserved the original data characteristics. The use of multiple imputation by chained equations (MICE) has ensured that the imputed values closely follow the patterns and spread of the observed data, maintaining the integrity of the variables for further analysis. In turn, the imputed values is added to the final working dataset for analysis. 



```{r}
# Overwrite original columns in fhs with imputed values
fhs$cigsPerDay <- fhs_imputed$cigsPerDay
fhs$totChol <- fhs_imputed$totChol
fhs$BMI <- fhs_imputed$BMI
fhs$heartRate <- fhs_imputed$heartRate
fhs$glucose <- fhs_imputed$glucose



```



Prior to modeling, the variable BPMeds (whether the participant was on blood pressure medication) contains missing values. To address this, a logistic regression model is fitted using relevant predictors (age, sysBP, diaBP, cigsPerDay, and prevalentHyp) based on cases with observed BPMeds values. Predicted probabilities from this model were then used to impute missing values, with a threshold of 0.5 to classify individuals as either taking or not taking blood pressure medication.

```{r}

# Save original BPMeds with NAs before imputation
fhs$BPMeds_original <- fhs$BPMeds

# Keep rows where BPMeds is not missing
bpm_complete <- fhs[!is.na(fhs$BPMeds_original), ]

# Fit logistic regression model
bpm_model <- glm(BPMeds_original ~ age + sysBP + diaBP + cigsPerDay + prevalentHyp,
                 data = bpm_complete, family = binomial)

# Identify rows with missing BPMeds
bpm_missing <- fhs[is.na(fhs$BPMeds_original), ]

# Identify rows among bpm_missing that have complete predictors
predictable_rows <- complete.cases(bpm_missing[, c("age", "sysBP", "diaBP", "cigsPerDay", "prevalentHyp")])

# Predict only on rows we can predict
if (any(predictable_rows)) {
  predicted_probs <- predict(bpm_model, newdata = bpm_missing[predictable_rows, ], type = "response")
  
  # Get the original row numbers for the missing data
  missing_row_indices <- which(is.na(fhs$BPMeds_original))
  
  # Find indices that are predictable
  predictable_indices <- missing_row_indices[predictable_rows]
  
  # Impute predicted values into the correct rows
  fhs$BPMeds[predictable_indices] <- ifelse(predicted_probs >= 0.5, 1, 0)
}


# Make sure fhs$BPMeds and fhs$BPMeds_original are there

# Create a long-format data frame
bpm_plot_data <- data.frame(
  value = c(fhs$BPMeds_original, fhs$BPMeds),
  status = rep(c("Before Imputation", "After Imputation"), each = nrow(fhs))
)

# Clean bar plot
ggplot(bpm_plot_data, aes(x = factor(value), fill = status)) +
  geom_bar(position = "dodge") +
  labs(title = "BPMeds Before and After Imputation",
       x = "BPMeds (0 = No, 1 = Yes)",
       y = "Count",
       fill = "Status") +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  theme_minimal(base_size = 14)

```

Figure G. Side by side bar chart of before and after logistic imputation.

The distribution of BPMeds before and after logistic imputation is very similar, indicating that the imputation process preserved the underlying structure of the data without introducing substantial bias.

```{r}

# Remove BPMeds_original column after imputation and plotting
fhs$BPMeds_original <- NULL

```


# FEATURE SELECTION


## Feature Selection

To reduce the number of variables included in pairwise relationship plots, feature selection was performed using a Random Forest regression model. The goal is to identify the most important numerical predictors of TenYearCHD, so that only the most relevant variables are examined in scatterplots and other visualizations.


```{r}


# Subset numeric predictors
numeric_vars <- sapply(fhs, is.numeric)
fhs_numeric <- fhs[, numeric_vars]

# Make sure TenYearCHD is included
fhs_numeric$TenYearCHD <- fhs$TenYearCHD  

# Random Forest for numeric features
library(randomForest)
rf_numeric <- randomForest(TenYearCHD ~ ., data = fhs_numeric, importance = TRUE)

# Extract importance
importance_numeric <- importance(rf_numeric)

# Create importance dataframe based on IncNodePurity
importance_numeric_df <- data.frame(
  Variable = rownames(importance_numeric),
  Importance = importance_numeric[, "IncNodePurity"]
)

# Sort by importance
importance_numeric_df <- importance_numeric_df[order(-importance_numeric_df$Importance), ]

# View results
print(importance_numeric_df)

```

Table 2. Most important numeric features. 


Using a Random Forest regression model, feature importance was evaluated for the numerical predictors of TenYearCHD. The results indicate that systolic blood pressure (sysBP), body mass index (BMI), and total cholesterol (totChol) are the most influential variables in predicting ten-year coronary heart disease risk. Other notable predictors include glucose, age, diastolic blood pressure (diaBP), and heart rate, while cigarettes per day (cigsPerDay) had the lowest relative importance among the numerical features considered.




To identify the most important categorical predictors of TenYearCHD, a Random Forest classification model was fit using only the categorical variables in the dataset. This feature selection step helps focus pairwise analyses on the most relevant categorical factors, improving the clarity and interpretability of the visualizations.


```{r}

# Subset only categorical variables
factor_vars <- sapply(fhs, is.factor)
fhs_factor <- fhs[, factor_vars]


# Check for missing values in the fhs_factor dataset
#colSums(is.na(fhs_factor))

# Remove BPMeds_original if it exists
fhs_factor$BPMeds_original <- NULL

# Ensure the target variable TenYearCHD is a factor in the original data
fhs$TenYearCHD <- as.factor(fhs$TenYearCHD)

# Add TenYearCHD back to the factor dataset if it was dropped
fhs_factor$TenYearCHD <- fhs$TenYearCHD

# Train Random Forest classifier on categorical features
rf_factor <- randomForest(TenYearCHD ~ ., data = fhs_factor, importance = TRUE)

# Get importance values
importance_factor <- importance(rf_factor)

# Convert to a clean data frame
importance_factor_df <- data.frame(
  Variable = rownames(importance_factor),
  Importance = importance_factor[, "MeanDecreaseGini"]
)

# Sort by importance
importance_factor_df <- importance_factor_df[order(-importance_factor_df$Importance), ]

# Print
print(importance_factor_df)

```

Table 3. Most important categorical features.


Using a Random Forest classification model, feature importance was evaluated for the categorical predictors of TenYearCHD. The results show that prevalent hypertension (prevalentHyp) is by far the most influential categorical variable, followed by diabetes and sex. Other categorical features such as prevalent stroke, use of blood pressure medication (BPMeds), and current smoking status had lower relative importance in predicting ten-year coronary heart disease risk.



# PAIRWISE RELATIONSHIPS

To explore the relationships between the most important numerical predictors of TenYearCHD, pairwise scatterplots are generated for the top five variables identified through Random Forest feature selection. These include systolic blood pressure (sysBP), body mass index (BMI), total cholesterol (totChol), glucose, and age. Analyzing these pairwise relationships helps to visually understand how these features correlate with each other and with the target variable, TenYearCHD. 


```{r}

# Select the top 5 most important numerical features
top_features <- fhs[, c("sysBP", "BMI", "totChol", "glucose", "age")]

# Compute the correlation matrix
cor_matrix <- cor(top_features)

# Print the correlation matrix with numbers
#print(cor_matrix)

# Create a clean correlation plot with numbers displayed inside the cells
corrplot(cor_matrix, 
         method = "color",         # Color-based visualization
         type = "full",            # Full matrix display
         col = colorRampPalette(c("red", "white", "blue"))(200), # Color gradient
         tl.cex = 0.8,             # Adjust label size
         number.cex = 0.8,         # Adjust correlation number size
         addCoef.col = "black",    # Add correlation coefficients in black
         title = "Correlation Matrix for Top 5 Numerical Features", 
         mar = c(0,0,1,0))         # Margins around the plot
```

Figure H. Correlation matrix for continuous features. 



The scatterplot matrix below displays pairwise relationships between the five most important numerical features identified for predicting coronary heart disease. These features include systolic blood pressure (sysBP), body mass index (BMI), total cholesterol (totChol), glucose, and age. The scatterplots provide a visual representation of how these variables relate to one another, with each plot showcasing the distribution of data points between two features. The inclusion of bootstrapping reduces clutter by sampling 30% of the data, allowing for clearer insights into potential patterns and associations.

```{r}


# Load necessary libraries
library(ggplot2)
library(gridExtra)

# Select the top 5 most important numerical features
top_features <- fhs[, c("sysBP", "BMI", "totChol", "glucose", "age")]

# Bootstrap the data (reduce size to 30% to declutter scatterplots)
set.seed(123)  # For reproducibility
top_features_sample <- top_features[sample(1:nrow(top_features), 0.3 * nrow(top_features)), ]

# Create the scatterplot matrix manually
pairs_list <- list()

# Loop over all pairs of features to create individual scatterplots
for(i in 1:(ncol(top_features_sample)-1)) {
  for(j in (i+1):ncol(top_features_sample)) {
    pairs_list[[paste(names(top_features_sample)[i], names(top_features_sample)[j], sep = "_")]] <- 
      ggplot(top_features_sample, aes_string(x = names(top_features_sample)[i], y = names(top_features_sample)[j])) +
      geom_point(alpha = 0.3, size = 0.5) + 
      labs(title = paste(names(top_features_sample)[i], "vs", names(top_features_sample)[j])) +
      theme_minimal() +
      theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5, vjust = 0),
        axis.title = element_text(size = 8),
        axis.text = element_text(size = 7)
      )
  }
}

# Arrange all scatterplots in a grid layout
grid.arrange(grobs = pairs_list, ncol = 5)  # Arrange in 5 columns, adjust ncol if needed

```

Figure I. Scatterplots of the five most important numeric features. 


The pairwise scatterplots reveal several moderate linear relationships among the top features. Specifically, systolic blood pressure (sysBP) and body mass index (BMI) exhibit a moderate linear relationship, as do sysBP and total cholesterol (totChol), BMI and totChol, sysBP and age, and totChol and age. These relationships suggest that as one variable increases, there tends to be a corresponding increase in the other, indicating a moderate association between these key health factors.




```{r}
# Set up the plotting area to have 2 rows and 3 columns
par(mfrow = c(2, 3))

# List of categorical variables to compare with TenYearCHD
categorical_vars <- c("sex", "currentSmoker", "BPMeds", "prevalentStroke", "prevalentHyp", "diabetes")

# Loop through each categorical variable
for (var in categorical_vars) {
  # Create mosaic plot for each variable compared with TenYearCHD
  mosaicplot(table(fhs[[var]], fhs$TenYearCHD), main = paste("TenYearCHD vs", var), 
             color = c("lightblue", "pink"), 
             xlab = paste(var), ylab = "TenYearCHD",
             las = 1)
  # Add labels for 1 = Yes, 0 = No
  text(x = 1, y = 1, labels = "1 = Yes", col = "black", pos = 4)
  text(x = 1, y = 2, labels = "0 = No", col = "black", pos = 4)
}



```

Figure J. Mosaic plots of relationships of categorical features. 


The mosaic plots reveal that there is a significant association between TenYearCHD and all the categorical variables, except for “currentSmoker.” Specifically, variables like “sex,” “BPMeds,” “prevalentStroke,” “prevalentHyp,” and “diabetes” show clear patterns where the distribution of TenYearCHD differs across their levels. However, “currentSmoker” does not appear to exhibit a strong association with the outcome, suggesting that smoking status may not be a strong predictor of TenYearCHD in this dataset.



# FEATURE ENGINEER 

To enhance model performance and uncover meaningful patterns, feature engineering was applied to the Framingham Heart Study dataset. This process included creating new variables, such as clusters, interaction terms, and clinically relevant risk indicators, to better capture complex relationships with ten-year cardiovascular disease risk.



```{r}

# Remove non-numeric columns (like factors and response variable)
fhs_numeric <- fhs[sapply(fhs, is.numeric)]
fhs_scaled <- scale(fhs_numeric)  # Standardize the features

set.seed(123)

# Standardize numeric variables
fhs_scaled <- scale(fhs[, sapply(fhs, is.numeric)])

# Compute WSS for k = 1 to 10
wss <- sapply(1:10, function(k){
  kmeans(fhs_scaled, centers = k, nstart = 10)$tot.withinss
})

# Plot Elbow Curve
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of clusters K",
     ylab = "Total within-clusters sum of squares",
     main = "Elbow Method for Determining Optimal K")


# Run k-means with chosen number of clusters (e.g., 3)
kmeans_result <- kmeans(fhs_scaled, centers = 3, nstart = 25)

# Add cluster to the dataset
fhs$Cluster <- as.factor(kmeans_result$cluster)

# Run PCA just for plotting (2D representation)
pca_result <- prcomp(fhs_scaled)

# Create data frame for plotting
plot_data <- data.frame(PC1 = pca_result$x[,1],
                        PC2 = pca_result$x[,2],
                        Cluster = fhs$Cluster)

# Plot the clusters
library(ggplot2)
ggplot(plot_data, aes(x = PC1, y = PC2, color = Cluster)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(title = "K-Means Clustering (Visualized via PCA)",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")



```

Figure K. Visualization of clusters. 


The cluster analysis revealed three well-separated and distinct groups within the Framingham Heart dataset, suggesting meaningful underlying patterns in the data. These clusters may represent different cardiovascular risk profiles, which could help inform targeted prevention or intervention strategies.



# METHODOLOGY: Predictive Modeling Approaches

Previous methods, such as regression models (including regularized versions like Lasso and Ridge) and SVMs, were used to address the formulated questions, excelling in linear relationships and smaller datasets. However, these models can struggle with non-linear data or high-dimensional feature spaces.

CART, Bagging, and Random Forests offer advantages by capturing complex, non-linear relationships without the need for linear assumptions. Random Forests, in particular, improve upon Bagging by constructing multiple decision trees and averaging their outputs, which helps to further reduce variance and avoid overfitting. While CART is interpretable, it may overfit without proper pruning. Bagging helps reduce variance by combining multiple models, and Random Forests further improve performance with ensemble learning. Each method has its strengths: regression models and SVMs excel at handling smaller, simpler datasets with linear or regularized relationships, while CART, Bagging, and Random Forests shine with larger datasets and more complex interactions. The trade-off between interpretability and performance is a key consideration when selecting the appropriate model for a given task.


# CART REGRESSION

CART (Classification and Regression Trees) regression is a decision tree algorithm used to predict continuous outcomes, such as total cholesterol (totChol) in the Framingham Heart Study dataset. The model development follows these key components:

	1.	Splitting Criterion: The tree recursively partitions the data by choosing splits that minimize the sum of squared errors (SSE) within the resulting nodes, aiming to create homogenous groups based on predictors like age, BMI, and blood pressure.
	2.	Tree Structure and Stopping Rules: To prevent the tree from growing too complex, stopping rules such as minimum number of observations required to split a node (minsplit), minimum size of terminal nodes (minbucket), and maximum tree depth (maxdepth) are applied.
	3.	Pruning: After the full tree is grown, cost-complexity pruning is used to avoid overfitting. This involves evaluating subtrees using cross-validation and selecting the optimal complexity parameter (cp) that balances model accuracy with simplicity.
	4.	Interpretability: The final tree structure provides a transparent model showing how different combinations of predictors lead to different predicted cholesterol levels.

## Initial Regression Tree

An initial regression tree was built to explore how various health and lifestyle factors predict total cholesterol levels in the Framingham dataset. By setting relatively flexible tree parameters, the model was able to identify key variables and potential nonlinear relationships through recursive partitioning.

```{r}

# Set seed for reproducibility
set.seed(123)


# Split the data into training (70%) and testing (30%) sets
trainIndex <- createDataPartition(fhs$TenYearCHD, p = 0.7, list = FALSE)
train_data <- fhs[trainIndex, ]
test.data <- fhs[-trainIndex, ]


# Build the initial CART regression tree for predicting total cholesterol
tree_model <- rpart(
  totChol ~ ., 
  data = train_data,
  method = "anova",  # Regression tree
  control = rpart.control(
    minsplit = 10,         # Minimum number of observations required to attempt a split
    minbucket = 5,         # Minimum number of observations in terminal nodes
    cp = 0.002,             # Complexity parameter 
    maxdepth = 5           # Maximum depth of the tree
  )
)

# Visualize the unpruned regression tree
rpart.plot(tree_model, main = "Initial CART Regression Tree for Total Cholesterol")


```

Figure L. Initial regression tree before pruning. 


# Hyperparameter Tuning

To improve model performance and prevent overfitting, we performed hyperparameter tuning on the CART regression model using 10-fold cross-validation. Specifically, we tuned the complexity parameter (cp), which controls the trade-off between tree size and predictive accuracy.


```{r}
# Set seed for reproducibility
set.seed(123)

# Build the initial regression tree with cross-validation
tree_model <- rpart(
  totChol ~ ., 
  data = train_data,
  method = "anova",  # Regression tree
  control = rpart.control(
    minsplit = 10,
    minbucket = 5,
    cp = 0.01,
    maxdepth = 5,
    xval = 10
  )
)

# View the complexity parameter table
pander(tree_model$cptable)

# Plot the complexity parameter vs. error (xerror)
plotcp(tree_model)


# Choose the optimal cp based on xerror
min_xerror <- min(tree_model$cptable[, "xerror"])
best_cp_row <- which.min(tree_model$cptable[, "xerror"])
best_cp <- tree_model$cptable[best_cp_row, "CP"]

# Print the best cp value
cat("Best CP: ", round(best_cp, 4), "\n")


```
Table 4. Cross-validation results for different complexity parameters. 

Figure M. Plot illustrates the error curve.

Although the complexity plot shows an elbow at cp = 0.051, indicating a point where further splits begin to yield diminishing returns, cross-validation identified cp = 0.01 as the optimal value for minimizing prediction error while maintaining a reasonable model complexity. More specifically, the plot indicates that cp = 0.12 is best when using the 1-standard error (1-SE) rule.


## Pruned Regression Tree: Best CP (1-SE) 


This code selects complexity parameters (cp) for pruning the regression tree based on both the minimum cross-validation error and the 1-SE rule. It then prunes the tree and makes predictions on the test set, saving the R-squared and RMSE values for later comparison across regression models.

```{r}
# Step 1: Identify the minimum xerror and its corresponding cp (for pruning with minimum error)
cp.table <- tree_model$cptable
min.xerror <- min(cp.table[, "xerror"])
min.cp.row <- which.min(cp.table[, "xerror"])
min.cp <- cp.table[min.cp.row, "CP"]

# Step 2: Compute the standard error of the minimum xerror
xerror.std <- cp.table[min.cp.row, "xstd"]
threshold <- min.xerror + xerror.std

# Step 3: Override best.cp manually to get a more complex tree
best.cp <- 0.012  # based on 1-SE rule, max CP for xerror < 1

# Step 4: Prune the tree using both min.cp and manually specified best.cp
pruned.tree.best.cp <- prune(tree_model, cp = best.cp)
pruned.tree.min.cp <- prune(tree_model, cp = min.cp)

# Step 5: Visualize the manually pruned tree
rpart.plot(pruned.tree.best.cp, 
           main = paste("Pruned Tree Best CP (1-SE): cp =", round(best.cp, 5)))



# Step 6: Make predictions on test data
pred.best.cp <- predict(pruned.tree.best.cp, newdata = test.data)
pred.min.cp  <- predict(pruned.tree.min.cp, newdata = test.data)

# Step 7: Save R-squared values
rsquared.bestcp <- cor(test.data$totChol, pred.best.cp)^2

# Save MSE for best.cp (1-SE rule)
rmse.bestcp <- sqrt(mean((pred.best.cp - test.data$totChol)^2))



```

Figure N. Pruned Tree Best CP using 1-SE Rule. 


 



## Pruned Regression Tree: Minimum CP

The following tree represents the pruned regression model using the complexity parameter (CP) that minimizes cross-validation error. This version of the tree (Minimum CP) aims to balance model complexity and predictive accuracy by fitting as closely as possible to the training data without overfitting.



```{r}
# Visualize the pruned tree using the minimum CP value
rpart.plot(pruned.tree.min.cp,
           main = paste("Pruned Regression Tree (Min CP): cp =", round(min.cp, 4)),
           type = 2,                # Label all nodes with predicted values
           extra = 101,             # Show fitted values and % of observations at each node
           under = TRUE,            # Display the split condition underneath the node
           faclen = 0,              # Avoid abbreviating factor levels
           cex = 0.7,               # Shrink text for better fit
           fallen.leaves = TRUE,    # Better layout for the leaves at the bottom
           box.palette = "GnBu",    # Green-blue color theme for boxes
           shadow.col = "gray")     # Add shadows to boxes for depth

#r^2 for min.cp
rsquared.mincp  <- cor(test.data$totChol, pred.min.cp)^2
#Save MSE for min.cp (minimum xerror)
rmse.mincp <- sqrt(mean((pred.min.cp - test.data$totChol)^2))
```
Figure O. Pruned Regression Tree, Minimum CP.

The final CART regression model predicting total cholesterol identified sex and cigarettes per day as the most important predictors. This suggests that cholesterol levels in the Framingham dataset are most effectively segmented by gender and smoking behavior.


## BAGGING CART Regression

To further improve predictive performance, we next apply Bagging (Bootstrap Aggregating) to CART regression. Bagging builds multiple trees on different bootstrap samples of the data and averages their predictions, helping to reduce variance and improve model stability compared to a single decision tree.

To optimize the Bagging CART regression model, we perform a systematic hyperparameter tuning process. We split the data into training and testing sets, set up 5-fold cross-validation, and explore different combinations of the number of bagged trees, complexity parameters (cp), and maximum tree depths to identify the settings that minimize out-of-bag (OOB) error.


```{r}

# Split the data into training and testing sets
set.seed(123)
train.index <- createDataPartition(fhs$totChol, p = 0.8, list = FALSE)
train.data <- fhs[train.index, ]
test.data <- fhs[-train.index, ]

# Set up train control for cross-validation
ctrl <- trainControl(
  method = "cv",
  number = 5,
  verboseIter = TRUE
)

# Define parameter combinations to test
nbagg.values <- c(10, 25, 50)     # number of bagged trees
cp.values <- c(0.01, 0.05, 0.1)   # candidate cp values for rpart
maxdepth.values <- c(5, 10, 20)   # maximum depth of the candidate tree

# Create an empty data frame to store results
results <- data.frame()

# Model tuning loop
for (nbagg in nbagg.values) {
  for (cp in cp.values) {
    for (maxdepth in maxdepth.values) {
      set.seed(123)
      model <- bagging(
        totChol ~ .,
        data = train.data,
        nbagg = nbagg,
        coob = TRUE,
        trControl = ctrl,
        control = rpart.control(cp = cp, 
                                maxdepth = maxdepth)
      )
      # Get OOB error from each iteration
      oob.error <- model$err
      # Store results
      results <- rbind(results, 
                       data.frame(nbagg = nbagg,
                                  cp = cp,
                                  maxdepth = maxdepth,
                                  oob.error = oob.error))
    }
  }
}

# Find the best combination that yields the minimum out-of-bag error
best.params <- results[which.min(results$oob.error), ]
pander(best.params)



```
Table 5. Best hyperparameter combination for Bagging CART regression, showing 50 trees, a complexity parameter (cp) of 0.01, and a maximum tree depth of 5, resulting in the lowest out-of-bag error (39.09).





The final Bagging CART model is fitted to the training data using the optimal hyperparameters from the tuning process. Predictions are then generated on the test set for later evaluation.

```{r}

# Retrain the final model using the best parameters
set.seed(123)
final.bagging.model <- bagging(
  totChol ~ .,
  data = train.data,
  nbagg = best.params$nbagg,
  coob = TRUE,
  control = rpart.control(cp = best.params$cp, maxdepth = best.params$maxdepth)
)

# Make predictions on test set
pred.bagging <- predict(final.bagging.model, newdata = test.data)

# Calculate MSE and R-squared
rmse.bagging <- sqrt(mean((pred.bagging - test.data$totChol)^2))

ss.res <- sum((pred.bagging - test.data$totChol)^2)
ss.tot <- sum((mean(train.data$totChol) - test.data$totChol)^2)
rsq.bagging <- 1 - (ss.res / ss.tot)

```
 
 
In the next step, the importance of each predictor in the final Bagging model is evaluated. This analysis helps identify the features that have the most significant influence on predicting the target variable, offering insights into the model’s decision-making process. 
 
 
```{r}
 # Extract variable importance using caret's varImp function
var.imp <- varImp(final.bagging.model)

# Custom function to get variable importance from bagging model
get.bagging.importance <- function(model) {
  # Get all the trees from the bagging model
  trees <- model$mtrees
  
  # Initialize importance vector
  imp <- numeric(length(trees[[1]]$btree$variable.importance))
  names(imp) <- names(trees[[1]]$btree$variable.importance)
  
  # Sum importance across all trees
  for (tree in trees) {
    imp[names(tree$btree$variable.importance)] <- 
      imp[names(tree$btree$variable.importance)] + 
      tree$btree$variable.importance
  }
  
  # Average importance
  imp <- imp / length(trees)
  return(imp)
}

# Get variable importance from the final bagging model
bagging_importance <- get.bagging.importance(final.bagging.model)

# Display variable importance
#print(bagging_importance)

# Get importance
importance.scores <- get.bagging.importance(final.bagging.model)


# Set custom margins: bottom, left, top, right
par(mar = c(5, 8, 4, 2))  # increases left margin for labels


# Sort and plot
importance.scores <- sort(importance.scores, decreasing = TRUE)
barplot(importance.scores, horiz = TRUE, las = 1,
        main = "Variable Importance - Bagging (ipred)",
        xlab = "Importance Score")
 
 
```

Figure P. The bar chart displays the relative importance of each predictor variable in the final Bagging model, with the variables ordered by their contribution to the model’s predictions.



## Random Forest Regression

Random Forest regression is a powerful ensemble learning technique that aggregates predictions from multiple decision trees to improve accuracy and robustness. In the process, hyperparameters such as the number of trees, the number of variables tried at each split, and the minimum node size play a crucial role in determining the model’s performance. The next step involves tuning these hyperparameters to optimize the Random Forest model for predicting total cholesterol (totChol) using cross-validation and evaluating the resulting performance.


Hyperparameter tuning for the Random Forest model involves testing different combinations of parameters, including mtry, ntree, and nodesize, to identify the optimal configuration based on RMSE performance.

```{r}
# Hyperparameter Tuning for Random Forest
tune_grid <- expand.grid(
  mtry = c(2, 4, 6, 8),
  ntree = c(100, 200),
  nodesize = c(1, 5)
)

tune_results <- list()

for (i in 1:nrow(tune_grid)) {
  rf_temp <- randomForest(
    totChol ~ .,
    data = train.data,
    mtry = tune_grid$mtry[i],
    ntree = tune_grid$ntree[i],
    nodesize = tune_grid$nodesize[i]
  )
  preds <- predict(rf_temp, newdata = test.data)
  rmse <- sqrt(mean((test.data$totChol - preds)^2))
  tune_results[[i]] <- c(
    rmse = rmse,
    mtry = tune_grid$mtry[i],
    ntree = tune_grid$ntree[i],
    nodesize = tune_grid$nodesize[i]
  )
}

# Create data frame and find best parameters
tune_df <- as.data.frame(do.call(rbind, tune_results))
best_params <- tune_df[which.min(tune_df$rmse), ]


```



After tuning the Random Forest model, a separate Bagging model is built by setting the number of predictors considered at each split equal to the total number of available predictors. This approach emphasizes reducing variance by fully growing each tree and averaging their predictions for more stable results.


```{r}

set.seed(123)

# Bagging model: mtry = total number of predictors
bagging_model <- randomForest(
  totChol ~ ., 
  data = train.data, 
  mtry = ncol(train.data) - 1,  # Exclude target variable
  importance = TRUE
)

# Predict on test data
bagging_pred <- predict(bagging_model, newdata = test.data)

# Performance: RMSE
bagging_rmse <- sqrt(mean((test.data$totChol - bagging_pred)^2))
#print(paste("Bagging RMSE:", round(bagging_rmse, 2)))

```


Using the best hyperparameters, the final Random Forest model is trained on the full training set. Predictions are made on the test data, and model performance is assessed through RMSE and R-squared values.


```{r}
# Final Random Forest Model with Best Params
#-------------------------------------------
final_rf <- randomForest(
  totChol ~ ., 
  data = train.data,
  mtry = best_params$mtry,
  ntree = best_params$ntree,
  nodesize = best_params$nodesize,
  importance = TRUE
)

# Predict on test data
rf_pred <- predict(final_rf, newdata = test.data)

# Calculate MSE
rmse.rforest <- sqrt(mean((test.data$totChol - rf_pred)^2))

# Calculate R-squared
ss.res <- sum((test.data$totChol - rf_pred)^2)
ss.tot <- sum((mean(train.data$totChol) - test.data$totChol)^2)
rsq.rforest <- 1 - (ss.res / ss.tot)

print(final_rf)

```
Table 6. Summary of the final Random Forest model, displaying the number of trees, selected hyperparameters, out-of-bag (OOB) error estimate, and variable importance measures.



Variable importance is evaluated for the final Random Forest model to identify the predictors that contribute most to the prediction of total cholesterol.

```{r}
# Variable importance
importance(final_rf)
varImpPlot(final_rf, main = "Random Forest Variable Importance")

```
Table 7. Variable importance table. 

Figure Q. Variable importance plot. 

The variable importance table presents two key metrics: %IncMSE, which measures the increase in mean squared error when a variable is randomly permuted, and IncNodePurity, which reflects the total decrease in node impurity due to splits on each variable. Higher values in either metric indicate greater influence on the model’s predictive accuracy. 

The accompanying Random Forest Variable Importance Plot visually ranks the predictors based on these scores, providing an intuitive way to identify the most impactful features in the model’s performance.




The table of variable importance for the Random Forest model shows two key metrics: %IncMSE, which indicates how much the mean squared error increases when each variable is permuted, and IncNodePurity, which reflects the total decrease in node impurity due to each variable. To visualize these results, the variable importance plot ranks the features based on their impact, providing an intuitive view of which variables are most influential in predicting the target.



## Comparison: Model Performance

In this analysis, various regression models are evaluated to predict total cholesterol (totChol) using the Framingham Heart Study dataset. The models’ performance is assessed through key metrics such as Root Mean Squared Error (RMSE) and R-squared, with the goal of determining which model most accurately captures the underlying patterns in the data.

```{r}

#--- Assumes these values are already computed and stored ---
# mse.bestcp, mse.mincp
# rsquared.bestcp, rsquared.mincp
# mse.bagging, rsq.bagging
# mse.rforest, rsq.rforest

# Calculate LSE model (Stepwise AIC)
model.lse <- step(lm(totChol ~ ., data = train.data), direction = "both", trace = FALSE)
pred.lse <- predict(model.lse, newdata = test.data)

# Compute metrics for LSE model
rmse.lse <- sqrt(mean((test.data$totChol - pred.lse)^2))
ss.res.lse <- sum((test.data$totChol - pred.lse)^2)
ss.tot.lse <- sum((mean(train.data$totChol) - test.data$totChol)^2)
rsq.lse <- 1 - (ss.res.lse / ss.tot.lse)

#--- Create summary table ---
model_results <- data.frame(
  Model = c("Pruned Tree (min cp)", 
            "Pruned Tree (1-SE cp)", 
            "Bagging", 
            "Random Forest", 
            "Stepwise AIC (LSE)"),
  RMSE = c(rmse.mincp, 
          rmse.bestcp, 
          rmse.bagging, 
          rmse.rforest, 
          rmse.lse),
  R_squared = c(rsquared.mincp, 
                rsquared.bestcp, 
                rsq.bagging, 
                rsq.rforest, 
                rsq.lse)
)

# View comparison
print(model_results)

```

Table 8. RMSE and R-squared values for the four tree models, with Stepwise AIC (LSE) as the baseline for comparison. 


Random Forest outperformed the other regression models, achieving the highest R-squared and the lowest RMSE, indicating its superior ability to capture the relationships within the data. Pruned Tree (min cp) followed closely behind, showing competitive performance but slightly lower predictive accuracy compared to Random Forest.
cleaned of redundancy but metric are different than my code




# CART CLASSIFICATION

CART (Classification and Regression Trees) classification constructs a decision tree that splits the data into increasingly homogeneous groups based on a categorical outcome, such as the presence or absence of 10-year coronary heart disease (TenYearCHD) in the Framingham Heart Study dataset. This approach identifies the most important predictors and optimal split points to best classify individuals according to their risk of developing CHD within ten years. The key components for this analysis are listed below.

1) Splitting Criterion: Uses measures like Gini impurity or entropy to choose the best variable and split point.
2) Stopping Rules: Parameters such as minsplit, minbucket, and maxdepth control tree growth and prevent overfitting.
3) Pruning: Reduces tree complexity using the complexity parameter (cp) and cross-validation to select an optimal subtree.
4) Prediction and Evaluation: The final tree is used to predict outcomes on test data, typically evaluated using accuracy, sensitivity, specificity, and AUC.


## Initial CART Classification

The CART classification model for predicting TenYearCHD is built using a subset of the Framingham Heart Study dataset. The tree is constructed with specific parameters such as limiting the depth of the tree and ensuring a minimum number of observations for splits. The resulting model represents the full, unpruned tree, which is then visualized to show the class probabilities and labels at each node, providing insights into how different variables contribute to the prediction of 10-year coronary heart disease risk.

```{r}
# Set seed for reproducibility
set.seed(123)

# Split data into training (70%) and testing (30%) sets
train_index <- sample(1:nrow(fhs), size = 0.7 * nrow(fhs))
train_data <- fhs[train_index, ]
test_data <- fhs[-train_index, ]

# Build initial CART classification model (predicting TenYearCHD)
class_model <- rpart(
  TenYearCHD ~ ., 
  data = train_data,
  method = "class",  # Classification tree
  control = rpart.control(
    minsplit = 10,
    minbucket = 5,
    cp = 0.0001,
    maxdepth = 5
  )
)

# Visualize the classification tree
rpart.plot(class_model, 
           extra = 104,             # Show probabilities and class labels
           box.palette = "GnBu",    # Green to blue gradient
           branch.lty = 1, 
           shadow.col = "gray", 
           nn = TRUE,               # Show node numbers
           main = "Initial CART Classification Tree for TenYearCHD")

```

Figure R. Initial CART Decision Tree. 


## Tuning


This code evaluates the tuning of a classification tree model by examining the complexity parameter (cp) values from the model’s cost-complexity table (cptable). It identifies the cp that corresponds to the minimum cross-validation error (xerror) and the cp within the 1-standard error (1-SE) range, which helps select the simplest tree that still provides optimal performance. By tuning the cp, we aim to balance model complexity and predictive accuracy, with the 1-SE rule serving as a conservative approach to prevent overfitting. The output provides both the minimum cp and the cp within the 1-SE range, offering valuable insights into the trade-off between a more complex model and a simpler, potentially more generalizable one.




```{r}

pander(class_model$cptable)

min.row <- which.min(class_model$cptable[, "xerror"])
min.cp <- class_model$cptable[min.row, "CP"]
se <- class_model$cptable[min.row, "xstd"]
xerror.cutoff <- class_model$cptable[min.row, "xerror"] + se

# Now find the simplest (largest cp) tree within 1-SE range
cp.1SE <- max(class_model$cptable[class_model$cptable[, "xerror"] <= xerror.cutoff, "CP"])

cat("min.cp:", min.cp, "\n")
cat("cp.1SE:", cp.1SE, "\n")


# Check the number of terminal nodes (leaves) in both trees
#cat("Terminal nodes for pruned.tree.1SE:", length(pruned.tree.1SE$frame$var[pruned.tree.1SE$frame$var == "<leaf>"]), "\n")
#cat("Terminal nodes for pruned.tree.min:", length(pruned.tree.min$frame$var[pruned.tree.min$frame$var == "<leaf>"]), "\n")

# Compare splits between the two trees (structure)
#print(pruned.tree.1SE)
#print(pruned.tree.min)

# Print the structure of the pruned trees to inspect differences
#summary(pruned.tree.1SE)
#summary(pruned.tree.min)


```

Table 9. Complexity Parameter Table.

The minimum cross-validated error occurred at a complexity parameter (cp) value of 0.0102. Using the 1-SE rule, the simplest tree within one standard error of the minimum error also corresponds to the same cp value of 0.0102. This suggests that no further simplification is necessary, and the tree with cp = 0.0102 is both the most accurate and the most parsimonious choice.


The following plot displays the cross-validation results for the classification tree model, illustrating how the complexity parameter (cp) influences the model’s cross-validation error.


```{r}
# Plot the cross-validation results
plotcp(class_model)

```

Figure S. CART Complexity Parameter (cp) Plot. 

The cross-validation plot shows that the lowest error occurs at a tree size of 5, corresponding to a complexity parameter (cp) of 0.0062, suggesting that this tree size offers the best balance between model complexity and predictive performance.


## CART Classification (1-SE Rule)


The pruned CART classification tree using the 1-SE rule is a method for simplifying the decision tree model by reducing its complexity while maintaining predictive performance. This approach involves selecting the smallest tree within one standard error of the minimum cross-validation error, ensuring a balance between model simplicity and accuracy. Using the plot, a cp of 0.0062 is used for 1-SE Rule. 

```{r}

# 1. Find the optimal cp value that minimizes cross-validated error
min.cp <- class_model$cptable[which.min(class_model$cptable[, "xerror"]), "CP"]

# 2. Manually specify the 1-SE cp value (you already found this earlier)
cp.1SE <- 0.0062

# 3. Prune the tree using the 1-SE cp value
pruned.tree.1SE <- prune(class_model, cp = cp.1SE)

# 4. Visualize the pruned tree (1-SE rule version)
rpart.plot(pruned.tree.1SE, 
           extra = 104,              # Show predicted class and probability
           box.palette = "GnBu",     # Green to blue color scheme
           branch.lty = 1, 
           shadow.col = "gray", 
           nn = TRUE, 
           main = "Pruned CART Classification Tree (1-SE Rule)")

# 5. Generate predictions for pruned tree (1-SE rule)
pred.prob.1SE <- predict(pruned.tree.1SE, test_data, type = "prob")[, 2]

# 6. Create ROC curve for pruned tree (1-SE rule)
roc.tree.1SE <- pROC::roc(test_data$TenYearCHD, pred.prob.1SE)

# 7. Access the AUC directly from the ROC object
auc.tree.1SE <- roc.tree.1SE$auc


# 8. Optionally, print the AUC
#print(paste("AUC for pruned tree (1-SE):", round(auc.tree.1SE, 4)))


```

Figure T. Pruned CART Classification tree using 1-SE Rule.




## CART Classification Tree (Min CP)

The pruned CART classification tree is visualized using the minimum complexity parameter (cp) value, which helps in reducing model complexity while retaining predictive accuracy. This visualization highlights the class probabilities and the number of observations at each node, providing a clearer understanding of the model’s decision-making process.

```{r}

# Prune the tree using the cp that gives minimum cross-validated error
pruned.tree.min <- prune(class_model, cp = min.cp)

# Visualize the pruned tree with the chosen minimum cp
rpart.plot(pruned.tree.min, 
           extra = 104,  # Displays more detailed information on nodes (class probabilities, number of observations)
           box.palette = "GnBu",  # Green to Blue color scheme for node boxes
           branch.lty = 1,  # Solid lines for branches
           shadow.col = "gray",  # Shadow color for node labels
           nn = TRUE,  # Add node numbers to the tree
           main = "Pruned CART Classification Tree (Min CP)"  # Title for the plot
)




```

Figure U. Pruned CART Classification tree using min CP.

The pruned CART classification tree using the minimum CP resulted in a single node, suggesting that this approach did not produce a meaningful or informative model.



## BAGGING Cart Classification

The BAGGING (Bootstrap Aggregating) method is applied to the CART (Classification and Regression Trees) model to predict TenYearCHD, a binary outcome indicating the likelihood of cardiovascular disease within ten years. BAGGING combines multiple decision trees built on bootstrapped subsets of the data, reducing variance and improving model robustness. By averaging the predictions of individual trees, BAGGING helps mitigate overfitting, making it particularly useful for complex datasets like the Framingham Heart Study. The analysis explores the model’s performance through cross-validation, tuning hyperparameters, and evaluating the results based on RMSE and R-squared, ultimately identifying the optimal parameters for accurate predictions.


First, the BAGGING model is trained using the chosen training data and aggregate the predictions from each decision tree to make the final classification, ensuring that each model’s prediction contributes to the overall output.

```{r}
# Set seed for reproducibility
set.seed(123)

# Ensure the target is a factor
fhs$TenYearCHD <- as.factor(fhs$TenYearCHD)

# Split data into training and testing sets
trainIndex <- createDataPartition(fhs$TenYearCHD, p = 0.7, list = FALSE)
trainData <- fhs[trainIndex, ]
testData <- fhs[-trainIndex, ]

# Create hyperparameter grid
hyper.grid <- expand.grid(
  nbagg = c(25, 50, 100),
  minsplit = c(5, 10, 20),
  maxdepth = c(5, 10, 20),
  cp = c(0.01, 0.001)
)

# Initialize tracking variables
results <- data.frame()
best.accuracy <- 0
best.params <- list()

# Loop through hyperparameter combinations
for(i in 1:nrow(hyper.grid)) {
  params <- hyper.grid[i, ]
  
  ctrl <- rpart.control(
    minsplit = params$minsplit,
    maxdepth = params$maxdepth,
    cp = params$cp
  )
  
  # Train Bagging model
  bag.model <- bagging(
    TenYearCHD ~ .,
    data = trainData,
    nbagg = params$nbagg,
    coob = TRUE,
    control = ctrl
  )
  
  # Predict on test set
  preds <- predict(bag.model, newdata = testData)
  
  # Accuracy
  cm <- confusionMatrix(preds, testData$TenYearCHD)
  accuracy <- cm$overall["Accuracy"]
  
  # Store results
  results <- rbind(results, data.frame(
    nbagg = params$nbagg,
    minsplit = params$minsplit,
    maxdepth = params$maxdepth,
    cp = params$cp,
    Accuracy = accuracy
  ))
  
  # Track best result
  if(accuracy > best.accuracy) {
    best.accuracy <- accuracy
    best.params <- params
  }
}

# Display best parameters
cat("Best BAGGING hyperparameters:\n")
pander(best.params)

# Optional: View full results
# pander(results)


```

Table 10. Train final BAGGING Classification model with best hyperparameters


The output shows the optimal hyperparameters for the BAGGING model, with 100 trees (nbagg), a minimum split of 20 observations (minsplit), a maximum depth of 20 for each tree (maxdepth), and a complexity parameter (cp) of 0.001.


Next the final BAGGING model is trained using the best hyperparameters identified during tuning. The model is then used to predict the test data, and the results are evaluated using a labeled confusion matrix to assess the classification accuracy for predicting TenYearCHD.

```{r}
# Set rpart control using best parameters from tuning
best.control <- rpart.control(
  minsplit = best.params$minsplit,
  maxdepth = best.params$maxdepth,
  cp = best.params$cp
)

# Train the final Bagging model on training data
final.bag.model <- bagging(
  TenYearCHD ~ .,
  data = trainData,
  nbagg = best.params$nbagg,
  coob = TRUE,
  control = best.control
)

# Generate predictions (probabilities) for the Bagging model
pred.bag.prob <- predict(final.bag.model, newdata = testData, type = "prob")[, 2]  # [ ,2] selects the probability for the positive class (1)

# Compute ROC for Bagging model
roc.bagging <- roc(testData$TenYearCHD, pred.bag.prob)

# Compute AUC for Bagging model
auc.bagging <- roc.bagging$auc

# Plot ROC curve for the Bagging model
#plot(roc.bagging, main = "ROC Curve - Bagging", col = "blue", lwd = 2)

# Now proceed with comparing other models and plotting as needed


```



## Random Forest Classification

Random Forest classification is an ensemble learning method that builds multiple decision trees and aggregates their predictions to improve classification accuracy. Each tree is trained on a random subset of the data with random feature selection at each split, which helps reduce overfitting and increases the model’s robustness. The final prediction is determined by the majority vote of the individual trees, making Random Forest a powerful tool for classification tasks, especially when dealing with complex and high-dimensional data.

Hyperparameter tuning involves selecting the optimal settings for a model’s parameters, such as the number of trees, the maximum depth of each tree, and the minimum number of samples required for a split, to enhance its performance and accuracy. 

```{r}

set.seed(123)

# Drop BPMeds_original from both sets
train.data <- train.data[, !colnames(train.data) %in% "BPMeds_original"]
test.data <- test.data[, !colnames(test.data) %in% "BPMeds_original"]

# Ensure same columns and column order
test.data <- test.data[, names(train.data)]

# Ensure all factor levels are aligned
for (col in names(train.data)) {
  if (is.factor(train.data[[col]])) {
    test.data[[col]] <- factor(test.data[[col]], levels = levels(train.data[[col]]))
  }
}

# Two-way split: training/testing
train.idx <- createDataPartition(fhs$TenYearCHD, p = 0.7, list = FALSE)
train.data <- fhs[train.idx, ]
test.data <- fhs[-train.idx, ]

test.data <- test.data[, names(train.data)]

# Ensure outcome is factor with levels "0", "1"
train.data$TenYearCHD <- as.factor(train.data$TenYearCHD)
test.data$TenYearCHD <- as.factor(test.data$TenYearCHD)
# Ensure 'BPMeds_original' is not in the data
train.data <- train.data[, !colnames(train.data) %in% "BPMeds_original"]

# Cross-validation setup
k <- 5
train.size <- nrow(train.data)
fold.size <- floor(train.size / k)

# Hyperparameter grid
tune.grid <- expand.grid(
  mtry = c(2, 3, 4, 5),
  ntree = c(100, 300, 500),
  nodesize = c(1, 3, 5, 10),
  maxnodes = c(5, 10, 20, NA)
)

# Ensure valid mtry values before starting
p <- ncol(train.data) - 1  # Number of predictors
tune.grid <- subset(tune.grid, mtry >= 1 & mtry <= p & nodesize < maxnodes)

# Initialize results storage
results <- data.frame()
best.auc <- 0
best.hyp.params <- NULL

# Grid search with cross-validation
for (i in 1:nrow(tune.grid)) {
  current.tune.params <- tune.grid[i, ]
  cv.auc <- numeric(k)  # Preallocate AUC vector for folds
  
  for (j in 1:k) {
    # Define training and validation sets
    cv.id <- ((j - 1) * fold.size + 1):(j * fold.size)
    cv.train <- train.data[-cv.id, ]
    cv.valid <- train.data[cv.id, ]
    
    # Train the model
    rf.cv <- randomForest(
      TenYearCHD ~ .,
      data = cv.train,
      mtry = current.tune.params$mtry,
      ntree = current.tune.params$ntree,
      nodesize = current.tune.params$nodesize,
      maxnodes = current.tune.params$maxnodes
    )
    
    # Predict probabilities for positive class '1'
    prob.cv <- predict(rf.cv, newdata = cv.valid, type = "prob")[, "1"]
    
    # Compute AUC using pROC
    roc_obj <- pROC::roc(
      response = cv.valid$TenYearCHD,
      predictor = prob.cv,
      direction = "<"
    )
    cv.auc[j] <- pROC::auc(roc_obj)
  }
  
  # Compute average AUC over folds
  avg.auc <- mean(cv.auc)
  
  # Store results
  results <- rbind(results, data.frame(
    mtry = current.tune.params$mtry,
    ntree = current.tune.params$ntree,
    nodesize = current.tune.params$nodesize,
    maxnodes = current.tune.params$maxnodes,
    auc = avg.auc
  ))
  
  # Update best if current is better
  if (avg.auc > best.auc) {
    best.auc <- avg.auc
    best.hyp.params <- current.tune.params
  }
}

# Display best parameters
pander(data.frame(cbind(best.hyp.params, best.auc)))



```

Table 11. Random Forest Hyperparameter Tuning Results Table


The output indicates the optimal hyperparameters for the Random Forest model, with mtry set to 4, ntree set to 500, nodesize set to 10, and maxnodes set to 20. These settings resulted in a best AUC of 0.6965, which reflects the model’s performance in distinguishing between the classes.



To finalize the Random Forest model, the best hyperparameters identified from tuning are used to train the model on the full training dataset. Predictions for class probabilities of the positive class (“1”) are then made for the test data. In the subsequent steps, the model’s performance will be assessed using the ROC curve and AUC, which will be presented during the comparison section.


```{r}

# Train the final Random Forest model using best parameters
final.rf.cls <- randomForest(
  TenYearCHD ~ .,
  data = train.data,
  ntree = best.hyp.params$ntree,
  mtry = best.hyp.params$mtry,
  nodesize = best.hyp.params$nodesize,
  maxnodes = best.hyp.params$maxnodes,
  importance = TRUE
)

# Predict class probabilities for the positive class "1"
pred.rf.prob <- predict(final.rf.cls, test.data, type = "prob")[, "1"]

# Ensure response is a factor with correct levels
test.response <- factor(test.data$TenYearCHD, levels = c(0, 1))

# Compute ROC and AUC
rf.roc <- pROC::roc(response = test.response, predictor = pred.rf.prob, levels = c("0", "1"), direction = "<")
rf.auc <- pROC::auc(rf.roc)

# Display AUC
#print(rf.auc)



```






The importance of each predictor variable in the final Random Forest model is assessed using two metrics: Mean Decrease in Accuracy and Mean Decrease in Gini. These measures are extracted and visualized to highlight the most influential variables, providing insights into which features contribute the most to the model’s predictive power. The results are displayed in a bar plot that separates the importance scores for each metric.

```{r}
# Extract importance measures
importance_vals <- importance(final.rf.cls)
importance_df <- data.frame(
  Variable = rownames(importance_vals),
  MeanDecreaseAccuracy = importance_vals[, "MeanDecreaseAccuracy"],
  MeanDecreaseGini = importance_vals[, "MeanDecreaseGini"]
)


# Pivot for plotting
importance_long <- pivot_longer(
  importance_df,
  cols = c(MeanDecreaseAccuracy, MeanDecreaseGini),
  names_to = "Metric",
  values_to = "Importance"
)

# Plot
ggplot(importance_long, aes(x = reorder(Variable, Importance), y = Importance, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  facet_wrap(~ Metric, scales = "free_x") +
  labs(title = "Variable Importance (Random Forest)",
       x = "Variable",
       y = "Importance Score") +
  theme_minimal() +
  theme(legend.position = "none")


```

Figure V. Random Forest Feature Importance Visualization


## Comparison 

To evaluate and compare the predictive performance of the pruned CART models, logistic regression, Bagging, and Random Forest, we generated ROC (Receiver Operating Characteristic) curves using the test dataset. These curves illustrate the trade-off between sensitivity and specificity across all possible classification thresholds, providing a visual comparison of model performance across different methods.



```{r}


# Fit the Logistic Regression Model
logit_model <- glm(TenYearCHD ~ ., data = train.data, family = binomial)

# Predict probabilities for the test set
pred.logit.prob <- predict(logit_model, newdata = test.data, type = "response")

# Ensure response is a factor with correct levels
test.response <- factor(test.data$TenYearCHD, levels = c(0, 1))

# Compute ROC and AUC for Logistic Regression
logit.roc <- roc(response = test.response, predictor = pred.logit.prob, levels = c("0", "1"), direction = "<")
logit.auc <- pROC::auc(logit.roc)


# Recompute prediction, ensure it's aligned and un-named
pred.prob.1SE <- predict(pruned.tree.1SE, newdata = test.data, type = "prob")[, 2]
pred.prob.1SE <- as.numeric(pred.prob.1SE)  # removes any row names

# ROC for each model (ensure the model probabilities are defined first)
roc.tree.1SE <- roc(test.data$TenYearCHD, pred.prob.1SE)

pred.prob.min <- predict(pruned.tree.min, newdata = test.data, type = "prob")[, 2]
pred.prob.min <- as.numeric(pred.prob.min)

roc.tree.min <- roc(test.data$TenYearCHD, pred.prob.min)
roc.bagging <- roc(test.data$TenYearCHD, pred.bag.prob)
roc.rf <- roc(test.data$TenYearCHD, pred.rf.prob)
roc.logit <- roc(test.data$TenYearCHD, pred.logit.prob)

# AUC values
auc.tree.1SE <- pROC::auc(roc.tree.1SE)
auc.tree.min <- pROC::auc(roc.tree.min)
auc.bagging <- pROC::auc(roc.bagging)
auc.rf <- pROC::auc(roc.rf)
auc.logit <- pROC::auc(roc.logit)

# Plot ROC curves for comparison
par(mar = c(5, 5, 4, 1))
plot(1 - roc.logit$specificities, roc.logit$sensitivities, 
     type = "l", col = "darkorange", lwd = 2, 
     xlab = "1 - Specificity", ylab = "Sensitivity", 
     main = "ROC Curve Comparison")

lines(1 - roc.tree.1SE$specificities, roc.tree.1SE$sensitivities, col = "gold", lwd = 2)
lines(1 - roc.tree.min$specificities, roc.tree.min$sensitivities, col = "purple", lwd = 2)
lines(1 - roc.bagging$specificities, roc.bagging$sensitivities, col = "blue", lwd = 2)
lines(1 - roc.rf$specificities, roc.rf$sensitivities, col = "forestgreen", lwd = 2)

# Add a legend for clarity
legend("bottomright",
       legend = c("Logistic", "CART 1SE", "CART Min", "Random Forest", "Bagging"),
       col = c("darkorange", "gold", "purple", "blue", "forestgreen"),
       lty = c(1, 1, 1, 1, 1),
       lwd = 2,
       bty = "n",
       cex = 0.6)

### === Annotate with AUCs === ###
# Annotating the plot with the AUC values
text(0.75, 0.48, paste("Logistic AUC: ", round(auc.logit, 4)), cex = 0.7)
text(0.75, 0.42, paste("CART 1SE AUC: ", round(auc.tree.1SE, 4)), cex = 0.7)
text(0.75, 0.36, paste("CART Min AUC: ", round(auc.tree.min, 4)), cex = 0.7)
text(0.75, 0.30, paste("Random Forest AUC: ", round(auc.rf, 4)), cex = 0.7)
text(0.75, 0.24, paste("Bagging AUC: ", round(auc.bagging, 4)), cex = 0.7)

```



Figure W. ROC Curves for Classification Models. 

Receiver Operating Characteristic (ROC) Curves for Multiple Predictive Models: The plot compares the ROC curves for Logistic Regression, CART with 1-SE pruning, CART with Minimum Cost-Complexity Pruning, Bagging, and Random Forest. AUC values for each model are annotated, highlighting their classification performance. Among the models evaluated, Logistic Regression achieved the highest AUC, demonstrating superior accuracy in distinguishing between individuals who will and will not develop coronary heart disease over a ten-year period. This suggests that, in this scenario, Logistic Regression provides a more reliable risk prediction tool compared to the other models.

## Optimal Cutoff

The optimal cutoff represents the threshold probability at which the classification model balances sensitivity and specificity, maximizing the overall performance of the model. By using this cutoff, we ensure that predictions are made with the most effective trade-off between correctly identifying positive cases and minimizing false positives, which is crucial for making reliable decisions in practical applications such as healthcare or finance.


```{r}
# Find the optimal cutoff using Youden's J statistic
coords.optimal <- coords(roc.logit, x = "best", best.method = "youden", ret = c("threshold", "sensitivity", "specificity"))

# Print the optimal cutoff and associated sensitivity/specificity
print(coords.optimal)

```


Table 12. The optimal cutoff. 


The threshold value of 0.1563719 is the optimal cutoff. It is chosen to balance the sensitivity and specificity. If you are using this threshold, the model will classify a person as at risk for coronary heart disease if their predicted probability is greater than 15.6%. A sensitivity of 63.2% indicates that the model is fairly good at identifying individuals who will develop coronary heart disease. However, it misses about 36.8% of the true positives (false negatives). On the other hand, the specificity of 70.3% indicates that the model does a better job of correctly identifying individuals who will not develop the disease, though there is still some room for improvement (about 29.7% false positives).


# CONCLUSION

The regression models evaluated, including Pruned Trees (both 1-SE and Minimum Cost-Complexity), Bagging, and Random Forest, provided valuable insights into predicting total cholesterol levels (totChol). Among these, the Random Forest model demonstrated the best performance with the lowest RMSE and highest R-squared value. The Pruned Tree (Minimum Cost-Complexity) model, while still performing well, showed a higher RMSE, suggesting that its predictions were less accurate. Given the nature of the task—predicting a continuous variable such as cholesterol levels—it is recommended to utilize Random Forest as the primary model for future predictions due to its higher accuracy and robustness. However, for interpretability and simplicity, Pruned Trees may also be considered when a more transparent model is needed.


For classification, the comparison of models such as Logistic Regression, CART (1-SE and Min), Bagging, and Random Forest for predicting coronary heart disease (TenYearCHD) revealed that Logistic Regression outperformed all other models, achieving the highest AUC. This suggests that Logistic Regression is particularly effective at distinguishing between individuals who will and will not develop coronary heart disease in the next ten years. While Random Forest and Bagging showed strong performance, Logistic Regression provided a more reliable and simpler model for this task. Therefore, it is recommended to prioritize Logistic Regression for future use in clinical applications where understanding and explaining the risk of coronary heart disease is critical. For improved performance in more complex datasets or non-linear relationships, Random Forest and Bagging can be considered as alternative models for further exploration.

